2025.iwslt-1.12@ACL

Total: 1

#1 Human-Evaluated Urdu-English Speech Corpus: Advancing Speech-to-Text for Low-Resource Languages [PDF] [Copy] [Kimi] [REL]

Authors: Humaira Mehmood, Sadaf Abdul Rauf

This paper presents our contribution to the IWSLT Low Resource Track 2: ‘Training and Evaluation Data Track’. We share a human-evaluated Urdu-English speech-to-text corpus based on Common Voice 13.0 Urdu speech corpus. We followed a three-tier validation scheme which involves an initial automatic translation with corrections from native reviewers, full review by evaluators followed by final validation from a bilingual expert ensuring reliable corpus for subsequent NLP tasks. Our contribution, CV-UrEnST corpus, enriches Urdu speech resources by contributing the first Urdu-English speech-to-text corpus. When evaluated with Whisper-medium, the corpus yielded a significant improvement to the vanilla model in terms of BLEU, chrF++, and COMET scores, demonstrating its effectiveness for speech translation tasks.

Subject: IWSLT.2025