🤖 AI Summary
Existing TTS systems struggle to synthesize natural, coherent long-form multi-speaker dialogue—especially in modeling dialectal and paralinguistic diversity. To address this, we propose the first podcast-oriented, long-context, multi-turn, multi-speaker TTS framework. Our method innovatively integrates context-aware prosody prediction, dialect-adaptive prosody modeling, cross-lingual/dialectal transfer learning, and fine-grained paralinguistic feature control, achieving breakthroughs in speaker identity stability and dynamic intonation adaptation. The system supports Mandarin, English, and multiple Chinese dialects, enabling end-to-end synthesis of coherent dialogue audio exceeding 90 minutes. Experiments demonstrate state-of-the-art performance on both single-utterance TTS and multi-turn dialogue synthesis tasks, with significant improvements in speech naturalness, speaker consistency, and contextual appropriateness over prior approaches.
📝 Abstract
Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional TTS tasks.
To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions. Moreover, speakers exhibit contextually adaptive prosody, reflecting natural rhythm and intonation changes as dialogues progress. Across multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.