MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the poor generalization of descriptive text-to-speech (TTS) systems on out-of-domain user descriptions, this paper proposes a modality-oriented Mixture of Experts (MoE) architecture. Without fine-tuning the frozen pre-trained large language model (LLM), the method introduces trainable speech modality-specific expert networks and employs a gating mechanism to dynamically fuse their outputs. This enables efficient adaptation of pre-trained textual knowledge to speech generation while preserving linguistic capability. Experiments on a custom out-of-domain test set demonstrate significant improvements in semantic alignment between generated speech and input descriptions, as well as enhanced naturalness—particularly for complex or unseen descriptions. The proposed approach substantially outperforms existing commercial TTS systems in accuracy under such challenging conditions. These results validate the effectiveness of modality-decoupled MoE for cross-modal generalization in descriptive TTS.

Technology Category

Application Category

📝 Abstract

Description-based text-to-speech (TTS) models exhibit strong performance on in-domain text descriptions, i.e., those encountered during training. However, in real-world applications, the diverse range of user-generated descriptions inevitably introduces numerous out-of-domain inputs that challenge the text understanding capabilities of these systems. To address this issue, we propose MoE-TTS, a description-based TTS model designed to enhance the understanding of out-of-domain text descriptions. MoE-TTS employs a modality-based mixture-of-experts (MoE) approach to augment a pre-trained textual large language model (LLM) with a set of specialized weights adapted to the speech modality while maintaining the original LLM frozen during training. This approach allows MoE-TTS to effectively leverage the pre-trained knowledge and text understanding abilities of textual LLMs. Our experimental results indicate that: first, even the most advanced closed-source commercial products can be challenged by carefully designed out-of-domain description test sets; second, MoE-TTS achieves superior performance in generating speech that more accurately reflects the descriptions. We encourage readers to listen to the demos at https://welkinyang.github.io/MoE-TTS/.

Problem

Research questions and friction points this paper is trying to address.

Enhancing out-of-domain text understanding for TTS

Addressing diverse user-generated description challenges

Improving speech accuracy from complex descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-experts enhances out-of-domain text understanding

Pre-trained LLM augmented with speech-specialized weights

Frozen LLM maintains original knowledge during training

🔎 Similar Papers

No similar papers found.