SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing TTS systems struggle to synthesize natural, coherent long-form multi-speaker dialogue—especially in modeling dialectal and paralinguistic diversity. To address this, we propose the first podcast-oriented, long-context, multi-turn, multi-speaker TTS framework. Our method innovatively integrates context-aware prosody prediction, dialect-adaptive prosody modeling, cross-lingual/dialectal transfer learning, and fine-grained paralinguistic feature control, achieving breakthroughs in speaker identity stability and dynamic intonation adaptation. The system supports Mandarin, English, and multiple Chinese dialects, enabling end-to-end synthesis of coherent dialogue audio exceeding 90 minutes. Experiments demonstrate state-of-the-art performance on both single-utterance TTS and multi-turn dialogue synthesis tasks, with significant improvements in speech naturalness, speaker consistency, and contextual appropriateness over prior approaches.

Technology Category

Application Category

📝 Abstract

Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional TTS tasks. To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions. Moreover, speakers exhibit contextually adaptive prosody, reflecting natural rhythm and intonation changes as dialogues progress. Across multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.

Problem

Research questions and friction points this paper is trying to address.

Generating coherent multi-speaker conversational speech for podcasts

Supporting dialectal and paralinguistic diversity in speech synthesis

Achieving natural prosody and stable timbre in long dialogues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multi-speaker conversational podcast speech

Integrates paralinguistic controls for Mandarin and English

Supports multiple Chinese dialects for personalized generation

🔎 Similar Papers

People are poorly equipped to detect AI-powered voice clones