English to Central Kurdish Speech Translation: Corpus Creation, Evaluation, and Orthographic Standardization

#1 English to Central Kurdish Speech Translation: Corpus Creation, Evaluation, and Orthographic Standardization [PDF] [Copy] [Kimi] [REL]

Authors: Mohammad Mohammadamini, Daban Q. Jaff, Josep Crego, Marie Tahon, Antoine Laurent

We present KUTED, a speech-to-text translation (S2TT) dataset for Central Kurdish, derived from TED and TEDx talks. The corpus comprises 91,000 sentence pairs, including 170 hours of English audio, 1.65 million English tokens, and 1.40 million Central Kurdish tokens. We evaluate KUTED on the S2TT task and find that orthographic variation significantly degrades Kurdish translation performance, producing nonstandard outputs. To address this, we propose a systematic text standardization approach that yields substantial performance gains and more consistent translations. On a test set separated from TED talks, a fine-tuned Seamless model achieves 15.18 BLEU, and we improve Seamless baseline by 3.0 BLEU on the FLEURS benchmark. We also train a Transformer model from scratch and evaluate a cascaded system that combines Seamless (ASR) with NLLB (MT).

Subject: Computation and Language

Publish: 2026-04-01 08:14:25 UTC

2604.00613

#1 English to Central Kurdish Speech Translation: Corpus Creation, Evaluation, and Orthographic Standardization [PDF] [Copy] [Kimi] [REL]