Improving Code-Switching Speech Recognition with TTS Data Augmentation

๐Ÿ“… 2025-10-22
๐Ÿ›๏ธ Asia-Pacific Signal and Information Processing Association Annual Summit and Conference
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the performance bottleneck in automatic speech recognition (ASR) for conversational Englishโ€“Mandarin code-switching speech, primarily caused by the scarcity of authentic labeled data. To overcome this limitation, the authors propose leveraging a fine-tuned multilingual text-to-speech (TTS) model, CosyVoice2, to generate highly realistic synthetic code-switching utterances that mimic natural conversational patterns. This approach substantially expands both the volume and speaker diversity of the training data. By integrating the synthesized data with real recordings to train the ASR system, the method achieves significant improvements in recognition accuracy under low-resource conditions, reducing the mixed error rate from 12.1% to 10.1% on the DevMan test set and from 17.8% to 16.0% on the DevSGE test set.

Technology Category

Application Category

๐Ÿ“ Abstract
Automatic speech recognition (ASR) for conversational code-switching speech remains challenging due to the scarcity of realistic, high-quality labeled speech data. This paper explores multilingual text-to-speech (TTS) models as an effective data augmentation technique to address this shortage. Specifically, we fine-tune the multilingual CosyVoice2 TTS model on the SEAME dataset to generate synthetic conversational Chinese-English code-switching speech, significantly increasing the quantity and speaker diversity of available training data. Our experiments demonstrate that augmenting real speech with synthetic speech reduces the mixed error rate (MER) from 12.1 % to 10.1 % on DevMan and from 17.8 % to 16.0 % on DevSGE, indicating performance gains. These results confirm that multilingual TTS is an effective and practical tool for enhancing ASR robustness in low-resource, conversational code-switching scenarios.
Problem

Research questions and friction points this paper is trying to address.

code-switching
automatic speech recognition
data scarcity
low-resource
conversational speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

code-switching
speech recognition
TTS data augmentation
multilingual TTS
low-resource ASR
Y
Yue Heng Yeo
Institute for Infocomm Research (I2R), A*STAR, Singapore
Yuchen Hu
Yuchen Hu
Nanyang Technological University
SpeechLLMMultimodal
S
Shreyas Gopal
College of Computing and Data Science, Nanyang Technological University, Singapore
Y
Yizhou Peng
College of Computing and Data Science, Nanyang Technological University, Singapore
Hexin Liu
Hexin Liu
Nanyang Technological University
Speech recognitionlanguage identification
E
E. Chng
College of Computing and Data Science, Nanyang Technological University, Singapore