🤖 AI Summary
To address the poor generalization of descriptive text-to-speech (TTS) systems on out-of-domain user descriptions, this paper proposes a modality-oriented Mixture of Experts (MoE) architecture. Without fine-tuning the frozen pre-trained large language model (LLM), the method introduces trainable speech modality-specific expert networks and employs a gating mechanism to dynamically fuse their outputs. This enables efficient adaptation of pre-trained textual knowledge to speech generation while preserving linguistic capability. Experiments on a custom out-of-domain test set demonstrate significant improvements in semantic alignment between generated speech and input descriptions, as well as enhanced naturalness—particularly for complex or unseen descriptions. The proposed approach substantially outperforms existing commercial TTS systems in accuracy under such challenging conditions. These results validate the effectiveness of modality-decoupled MoE for cross-modal generalization in descriptive TTS.
📝 Abstract
Description-based text-to-speech (TTS) models exhibit strong performance on in-domain text descriptions, i.e., those encountered during training. However, in real-world applications, the diverse range of user-generated descriptions inevitably introduces numerous out-of-domain inputs that challenge the text understanding capabilities of these systems. To address this issue, we propose MoE-TTS, a description-based TTS model designed to enhance the understanding of out-of-domain text descriptions. MoE-TTS employs a modality-based mixture-of-experts (MoE) approach to augment a pre-trained textual large language model (LLM) with a set of specialized weights adapted to the speech modality while maintaining the original LLM frozen during training. This approach allows MoE-TTS to effectively leverage the pre-trained knowledge and text understanding abilities of textual LLMs. Our experimental results indicate that: first, even the most advanced closed-source commercial products can be challenged by carefully designed out-of-domain description test sets; second, MoE-TTS achieves superior performance in generating speech that more accurately reflects the descriptions. We encourage readers to listen to the demos at https://welkinyang.github.io/MoE-TTS/.