🤖 AI Summary
This study addresses the poor performance and inconsistent output in Central Kurdish speech translation, primarily caused by the scarcity of high-quality parallel data and the lack of orthographic standardization. To tackle these challenges, we present KUTED, the first large-scale English–Central Kurdish speech translation corpus comprising 91,000 sentence pairs, along with a systematic text normalization pipeline to unify orthographic conventions. Leveraging this resource, we train a from-scratch Transformer model and evaluate both the Seamless end-to-end system and a cascaded architecture combining NLLB and ASR components. Experimental results demonstrate that orthographic normalization yields a 3.0 BLEU improvement for Seamless on the FLEURS benchmark and achieves a BLEU score of 15.18 on an independent test set, significantly advancing speech translation for this low-resource language.
📝 Abstract
We present KUTED, a speech-to-text translation (S2TT) dataset for Central Kurdish, derived from TED and TEDx talks. The corpus comprises 91,000 sentence pairs, including 170 hours of English audio, 1.65 million English tokens, and 1.40 million Central Kurdish tokens. We evaluate KUTED on the S2TT task and find that orthographic variation significantly degrades Kurdish translation performance, producing nonstandard outputs. To address this, we propose a systematic text standardization approach that yields substantial performance gains and more consistent translations. On a test set separated from TED talks, a fine-tuned Seamless model achieves 15.18 BLEU, and we improve Seamless baseline by 3.0 BLEU on the FLEURS benchmark. We also train a Transformer model from scratch and evaluate a cascaded system that combines Seamless (ASR) with NLLB (MT).