English to Central Kurdish Speech Translation: Corpus Creation, Evaluation, and Orthographic Standardization

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This study addresses the poor performance and inconsistent output in Central Kurdish speech translation, primarily caused by the scarcity of high-quality parallel data and the lack of orthographic standardization. To tackle these challenges, we present KUTED, the first large-scale English–Central Kurdish speech translation corpus comprising 91,000 sentence pairs, along with a systematic text normalization pipeline to unify orthographic conventions. Leveraging this resource, we train a from-scratch Transformer model and evaluate both the Seamless end-to-end system and a cascaded architecture combining NLLB and ASR components. Experimental results demonstrate that orthographic normalization yields a 3.0 BLEU improvement for Seamless on the FLEURS benchmark and achieves a BLEU score of 15.18 on an independent test set, significantly advancing speech translation for this low-resource language.

Technology Category

Application Category

📝 Abstract

We present KUTED, a speech-to-text translation (S2TT) dataset for Central Kurdish, derived from TED and TEDx talks. The corpus comprises 91,000 sentence pairs, including 170 hours of English audio, 1.65 million English tokens, and 1.40 million Central Kurdish tokens. We evaluate KUTED on the S2TT task and find that orthographic variation significantly degrades Kurdish translation performance, producing nonstandard outputs. To address this, we propose a systematic text standardization approach that yields substantial performance gains and more consistent translations. On a test set separated from TED talks, a fine-tuned Seamless model achieves 15.18 BLEU, and we improve Seamless baseline by 3.0 BLEU on the FLEURS benchmark. We also train a Transformer model from scratch and evaluate a cascaded system that combines Seamless (ASR) with NLLB (MT).

Problem

Research questions and friction points this paper is trying to address.

speech translation

Central Kurdish

orthographic variation

nonstandard output

translation performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

speech-to-text translation

orthographic standardization

Central Kurdish