SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing TTS systems struggle to synthesize natural, coherent long-form multi-speaker dialogue—especially in modeling dialectal and paralinguistic diversity. To address this, we propose the first podcast-oriented, long-context, multi-turn, multi-speaker TTS framework. Our method innovatively integrates context-aware prosody prediction, dialect-adaptive prosody modeling, cross-lingual/dialectal transfer learning, and fine-grained paralinguistic feature control, achieving breakthroughs in speaker identity stability and dynamic intonation adaptation. The system supports Mandarin, English, and multiple Chinese dialects, enabling end-to-end synthesis of coherent dialogue audio exceeding 90 minutes. Experiments demonstrate state-of-the-art performance on both single-utterance TTS and multi-turn dialogue synthesis tasks, with significant improvements in speech naturalness, speaker consistency, and contextual appropriateness over prior approaches.

Technology Category

Application Category

📝 Abstract
Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional TTS tasks. To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions. Moreover, speakers exhibit contextually adaptive prosody, reflecting natural rhythm and intonation changes as dialogues progress. Across multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.
Problem

Research questions and friction points this paper is trying to address.

Generating coherent multi-speaker conversational speech for podcasts
Supporting dialectal and paralinguistic diversity in speech synthesis
Achieving natural prosody and stable timbre in long dialogues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multi-speaker conversational podcast speech
Integrates paralinguistic controls for Mandarin and English
Supports multiple Chinese dialects for personalized generation
🔎 Similar Papers
Hanke Xie
Hanke Xie
Northwestern Polytechnical University
Audio speech synthesis
H
Haopeng Lin
Soul AI Lab, China
W
Wenxiao Cao
Soul AI Lab, China
Dake Guo
Dake Guo
Northwestern Polytechnical University
Speech ProcessingSpeech Synthesis
Wenjie Tian
Wenjie Tian
Northwest Polytechnical University
speech generation
J
Jun Wu
Soul AI Lab, China
H
Hanlin Wen
Soul AI Lab, China
R
Ruixuan Shang
Soul AI Lab, China
H
Hongmei Liu
Soul AI Lab, China
Z
Zhiqi Jiang
Soul AI Lab, China
Yuepeng Jiang
Yuepeng Jiang
Northwestern Polytechnical University
Speech ProcessingSpeech SynthesisVoice Conversion
W
Wenxi Chen
X-LANCE Lab, Shanghai Jiao Tong University, China
Ruiqi Yan
Ruiqi Yan
Shanghai Jiao Tong University
Deep learningAudioSpeech
J
Jiale Qian
Soul AI Lab, China
Y
Yichao Yan
Soul AI Lab, China
S
Shunshun Yin
Soul AI Lab, China
M
Ming Tao
Soul AI Lab, China
X
Xie Chen
X-LANCE Lab, Shanghai Jiao Tong University, China
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University, Xi’an, China
Xinsheng Wang
Xinsheng Wang
Hong Kong University of Science and Technology (HKUST)
speech synthesissinging voice synthesisvoice conversion