Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data

📅 2024-09-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of bilingual speech data scarcity in code-switching (CS) speech synthesis by proposing an LLM-based speech modeling paradigm relying solely on monolingual corpora. Methodologically, we introduce a multilingual multimodal transformer (MLMT) large language model that unifies speech generation and recognition within a single architecture; design a cross-lingual word-level segmentation-and-recombination strategy to unsupervisedly construct high-quality CS training data; and integrate EnCodec speech tokenization with multimodal prompt learning for end-to-end joint modeling. Key contributions include: (1) the first unified LLM framework jointly modeling speech generation and recognition, and (2) a novel data construction paradigm eliminating reliance on authentic CS speech pairs. Experiments demonstrate that our approach significantly outperforms baselines—including VALL-E and Qwen-Audio—under comparable data scales, yielding substantial improvements in multilingual speech naturalness, speaker consistency, and automatic speech recognition accuracy.

Technology Category

Application Category

📝 Abstract
While large language models (LLMs) have been explored in the speech domain for both generation and recognition tasks, their applications are predominantly confined to the monolingual scenario, with limited exploration in multilingual and code-switched (CS) contexts. Additionally, speech generation and recognition tasks are often handled separately, such as VALL-E and Qwen-Audio. In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integrating multilingual speech generation and recognition tasks within the single LLM. Furthermore, we develop an effective data construction approach that splits and concatenates words from different languages to equip LLMs with CS synthesis ability without relying on CS data. The experimental results demonstrate that our model outperforms other baselines with a comparable data scale. Furthermore, our data construction approach not only equips LLMs with CS speech synthesis capability with comparable speaker consistency and similarity to any given speaker, but also improves the performance of LLMs in multilingual speech generation and recognition tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing code-switched text-to-speech synthesis in LLMs
Using only monolingual corpora for multilingual speech generation
Improving naturalness and speaker consistency in CS TTS
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual speech recognition and synthesis training
Code-switched data construction from monolingual corpora
Splitting and concatenating words across languages
🔎 Similar Papers
No similar papers found.
J
Jing Xu
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong SAR, China
D
Daxin Tan
Noah’s Ark Lab, Huawei, Hong Kong SAR, China
J
Jiaqi Wang
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
X
Xiao Chen
Noah’s Ark Lab, Huawei, Hong Kong SAR, China