CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech-to-speech (S2S) models suffer from inadequate cross-lingual alignment in Mandarin–English code-switching (CS) scenarios, resulting in low accuracy for knowledge-intensive question answering and frequent misunderstandings in open-domain dialogues. To address this, we introduce CS3-Bench—the first benchmark specifically designed for code-switching speech-to-speech evaluation—comprehensively assessing mainstream S2S models on bilingual switching tasks. We propose Chain of Recognition (CoR) to enhance cross-lingual semantic understanding and integrate Keyword Highlighting (KH) to explicitly guide bilingual phoneme generation. Furthermore, we jointly optimize data construction and training strategies to improve both CS recognition and synthesis capabilities. Experiments demonstrate substantial improvements: knowledge QA accuracy rises from 25.14% to 46.13%, and open-dialogue comprehension reaches 86.5%, significantly mitigating sublexical pronunciation errors.

Technology Category

Application Category

📝 Abstract
The advancement of multimodal large language models has accelerated the development of speech-to-speech interaction systems. While natural monolingual interaction has been achieved, we find existing models exhibit deficiencies in language alignment. In our proposed Code-Switching Speech-to-Speech Benchmark (CS3-Bench), experiments on 7 mainstream models demonstrate a relative performance drop of up to 66% in knowledge-intensive question answering and varying degrees of misunderstanding in open-ended conversations. Starting from a model with severe performance deterioration, we propose both data constructions and training approaches to improve the language alignment capabilities, specifically employing Chain of Recognition (CoR) to enhance understanding and Keyword Highlighting (KH) to guide generation. Our approach improves the knowledge accuracy from 25.14% to 46.13%, with open-ended understanding rate from 64.5% to 86.5%, and significantly reduces pronunciation errors in the secondary language. CS3-Bench is available at https://huggingface.co/datasets/VocalNet/CS3-Bench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating speech-to-speech models for Mandarin-English code-switching deficiencies
Addressing performance drops in knowledge-intensive multilingual question answering
Improving language alignment and reducing secondary language pronunciation errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain of Recognition enhances multilingual understanding
Keyword Highlighting guides cross-lingual speech generation
Data construction improves language alignment capabilities
Heyang Liu
Heyang Liu
Shanghai Jiao Tong University
ASRMultimodal understanding
Y
Yuhao Wang
Shanghai Jiao Tong University
Ziyang Cheng
Ziyang Cheng
University of Electronic Science and Technology of China
R
Ronghua Wu
Ant Group
Q
Qunshan Gu
Ant Group
Yanfeng Wang
Yanfeng Wang
Shanghai Jiao Tong University
Y
Yu Wang
Shanghai Jiao Tong University