CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current code-switching (CS) speech recognition and translation research is hindered by the scarcity of high-quality, multilingual, low-resource datasets—particularly those encompassing both real and synthetic CS speech. To address this, we introduce the first large-scale, multilingual CS speech dataset, covering 52 languages and 113 CS language pairs. Our methodology integrates generative TTS, concatenative synthesis, and rule-based sentence generation to systematically reduce reliance on high-resource languages. The dataset comprises 128 hours of training speech and four diverse test sets—including both synthetic and naturally occurring CS speech. Empirical evaluation demonstrates substantial improvements in modeling capacity and cross-lingual generalization for low-resource CS tasks. This resource establishes a critical infrastructure for fair, scalable, and robust multilingual speech understanding research.

Technology Category

Application Category

📝 Abstract
We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research. Dataset link: https://huggingface.co/datasets/byan/cs-fleurs.
Problem

Research questions and friction points this paper is trying to address.

Developing code-switched speech recognition systems for multilingual scenarios
Evaluating translation models for low-resource code-switched language pairs
Expanding code-switched speech research beyond high-resourced languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual code-switched speech dataset
Synthetic and generative TTS methods
Covers 113 language pairs
🔎 Similar Papers
No similar papers found.
B
Brian Yan
Carnegie Mellon University
Injy Hamed
Injy Hamed
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Code-switchingNLPSpeech RecognitionMachine Translation
Shuichiro Shimizu
Shuichiro Shimizu
Ph.D. student, Kyoto University
natural language processingspeech processing
V
Vasista Lodagala
Humain
William Chen
William Chen
Carnegie Mellon University
Spoken Language ProcessingSpeech RecognitionSpeech TranslationMachine Translation
Olga Iakovenko
Olga Iakovenko
University of Sheffield
artificial intelligencenatural language processingspeech recognitionneural networks
Bashar Talafha
Bashar Talafha
University of British Columbia
Artificial IntelligenceMachine LearningDeep LearningNatural Language ProcessingAlgorithms
Amir Hussein
Amir Hussein
Graduate Research Assistant, Johns Hopkins University
Speech ProcessingSpeech TranslationTransfer Learning
Alexander Polok
Alexander Polok
Brno University of Technology, Faculty of Information Technology
Machine learning
K
Kalvin Chang
Carnegie Mellon University
Dominik Klement
Dominik Klement
Brno University of Technology
Automatic Speech RecognitionSpeaker DiarizationMachine Learning
S
Sara Althubaiti
Humain
P
Puyuan Peng
University of Texas at Austin
Matthew Wiesner
Matthew Wiesner
Research Scientist, Johns Hopkins University
Speech Recognition
Thamar Solorio
Thamar Solorio
MBZUAI & University of Houston
Natural Language Processing
A
Ahmed Ali
Humain
Sanjeev Khudanpur
Sanjeev Khudanpur
The Johns Hopkins University
Human Language TechnologyStatistical ModelingInformation Theory
Shinji Watanabe
Shinji Watanabe
Carnegie Mellon University
Speech recognitionSpeech processingSpeech enhancementSpeech translation
C
Chih-Chen Chen
Carnegie Mellon University
Z
Zhen Wu
Carnegie Mellon University
Karim Benharrak
Karim Benharrak
UT Austin
Human-Computer InteractionHuman-AI CollaborationHuman-AI Co-Creativity
A
Anuj Diwan
University of Texas at Austin
Samuele Cornell
Samuele Cornell
Carnegie Mellon University, Language Technologies Institute
Speech ProcessingMachine Learning
E
Eunjung Yeo
University of Texas at Austin
Kwanghee Choi
Kwanghee Choi
University of Texas at Austin
SpeechMachine LearningComputational Linguistics