🤖 AI Summary
Current code-switching (CS) speech recognition and translation research is hindered by the scarcity of high-quality, multilingual, low-resource datasets—particularly those encompassing both real and synthetic CS speech. To address this, we introduce the first large-scale, multilingual CS speech dataset, covering 52 languages and 113 CS language pairs. Our methodology integrates generative TTS, concatenative synthesis, and rule-based sentence generation to systematically reduce reliance on high-resource languages. The dataset comprises 128 hours of training speech and four diverse test sets—including both synthetic and naturally occurring CS speech. Empirical evaluation demonstrates substantial improvements in modeling capacity and cross-lingual generalization for low-resource CS tasks. This resource establishes a critical infrastructure for fair, scalable, and robust multilingual speech understanding research.
📝 Abstract
We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research. Dataset link: https://huggingface.co/datasets/byan/cs-fleurs.