Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data

📅 2024-09-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenge of bilingual speech data scarcity in code-switching (CS) speech synthesis by proposing an LLM-based speech modeling paradigm relying solely on monolingual corpora. Methodologically, we introduce a multilingual multimodal transformer (MLMT) large language model that unifies speech generation and recognition within a single architecture; design a cross-lingual word-level segmentation-and-recombination strategy to unsupervisedly construct high-quality CS training data; and integrate EnCodec speech tokenization with multimodal prompt learning for end-to-end joint modeling. Key contributions include: (1) the first unified LLM framework jointly modeling speech generation and recognition, and (2) a novel data construction paradigm eliminating reliance on authentic CS speech pairs. Experiments demonstrate that our approach significantly outperforms baselines—including VALL-E and Qwen-Audio—under comparable data scales, yielding substantial improvements in multilingual speech naturalness, speaker consistency, and automatic speech recognition accuracy.

Technology Category

Application Category

📝 Abstract

While large language models (LLMs) have been explored in the speech domain for both generation and recognition tasks, their applications are predominantly confined to the monolingual scenario, with limited exploration in multilingual and code-switched (CS) contexts. Additionally, speech generation and recognition tasks are often handled separately, such as VALL-E and Qwen-Audio. In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integrating multilingual speech generation and recognition tasks within the single LLM. Furthermore, we develop an effective data construction approach that splits and concatenates words from different languages to equip LLMs with CS synthesis ability without relying on CS data. The experimental results demonstrate that our model outperforms other baselines with a comparable data scale. Furthermore, our data construction approach not only equips LLMs with CS speech synthesis capability with comparable speaker consistency and similarity to any given speaker, but also improves the performance of LLMs in multilingual speech generation and recognition tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing code-switched text-to-speech synthesis in LLMs

Using only monolingual corpora for multilingual speech generation

Improving naturalness and speaker consistency in CS TTS

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual speech recognition and synthesis training

Code-switched data construction from monolingual corpora

Splitting and concatenating words across languages

🔎 Similar Papers

No similar papers found.