CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Current code-switching (CS) speech recognition and translation research is hindered by the scarcity of high-quality, multilingual, low-resource datasets—particularly those encompassing both real and synthetic CS speech. To address this, we introduce the first large-scale, multilingual CS speech dataset, covering 52 languages and 113 CS language pairs. Our methodology integrates generative TTS, concatenative synthesis, and rule-based sentence generation to systematically reduce reliance on high-resource languages. The dataset comprises 128 hours of training speech and four diverse test sets—including both synthetic and naturally occurring CS speech. Empirical evaluation demonstrates substantial improvements in modeling capacity and cross-lingual generalization for low-resource CS tasks. This resource establishes a critical infrastructure for fair, scalable, and robust multilingual speech understanding research.

Technology Category

Application Category

📝 Abstract

We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research. Dataset link: https://huggingface.co/datasets/byan/cs-fleurs.

Problem

Research questions and friction points this paper is trying to address.

Developing code-switched speech recognition systems for multilingual scenarios

Evaluating translation models for low-resource code-switched language pairs

Expanding code-switched speech research beyond high-resourced languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual code-switched speech dataset

Synthetic and generative TTS methods

Covers 113 language pairs

🔎 Similar Papers

Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data