English to Central Kurdish Speech Translation: Corpus Creation, Evaluation, and Orthographic Standardization

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the poor performance and inconsistent output in Central Kurdish speech translation, primarily caused by the scarcity of high-quality parallel data and the lack of orthographic standardization. To tackle these challenges, we present KUTED, the first large-scale English–Central Kurdish speech translation corpus comprising 91,000 sentence pairs, along with a systematic text normalization pipeline to unify orthographic conventions. Leveraging this resource, we train a from-scratch Transformer model and evaluate both the Seamless end-to-end system and a cascaded architecture combining NLLB and ASR components. Experimental results demonstrate that orthographic normalization yields a 3.0 BLEU improvement for Seamless on the FLEURS benchmark and achieves a BLEU score of 15.18 on an independent test set, significantly advancing speech translation for this low-resource language.
📝 Abstract
We present KUTED, a speech-to-text translation (S2TT) dataset for Central Kurdish, derived from TED and TEDx talks. The corpus comprises 91,000 sentence pairs, including 170 hours of English audio, 1.65 million English tokens, and 1.40 million Central Kurdish tokens. We evaluate KUTED on the S2TT task and find that orthographic variation significantly degrades Kurdish translation performance, producing nonstandard outputs. To address this, we propose a systematic text standardization approach that yields substantial performance gains and more consistent translations. On a test set separated from TED talks, a fine-tuned Seamless model achieves 15.18 BLEU, and we improve Seamless baseline by 3.0 BLEU on the FLEURS benchmark. We also train a Transformer model from scratch and evaluate a cascaded system that combines Seamless (ASR) with NLLB (MT).
Problem

Research questions and friction points this paper is trying to address.

speech translation
Central Kurdish
orthographic variation
nonstandard output
translation performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

speech-to-text translation
orthographic standardization
Central Kurdish
corpus creation
Seamless model
🔎 Similar Papers
No similar papers found.
M
Mohammad Mohammadamini
LIUM, Le Mans University, Le Mans, France
D
Daban Q. Jaff
Erfurt University, Erfurt, Germany; Koya University, Koysinjaq, Iraq
J
Josep Crego
SYSTRAN (ChapsVision), Paris, France
M
Marie Tahon
LIUM, Le Mans University, Le Mans, France
Antoine Laurent
Antoine Laurent
LIUM, Le Mans Université
Automatic Speech RecognitionMachine LearningDeep Learning