MOSS-TTSD: Text to Spoken Dialogue Generation

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating multi-speaker conversational speech with natural turn-taking, cross-turn acoustic consistency, and long-term stability—capabilities lacking in current text-to-speech (TTS) systems. We propose the first end-to-end TTS model capable of synthesizing ultra-long, multi-party dialogues up to 60 minutes involving up to five speakers, using only scripts annotated with explicit speaker labels. The system supports zero-shot voice cloning and multilingual dialogue generation in a single inference pass. To evaluate performance without reliance on external speaker diarization tools, we introduce TTSD-eval, a novel objective evaluation framework, alongside an enhanced long-context modeling architecture and a forced-alignment-based assessment methodology. Experimental results demonstrate that our approach significantly outperforms both open-source and commercial baselines in terms of subjective naturalness and objective consistency metrics.

Technology Category

Application Category

📝 Abstract
Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.
Problem

Research questions and friction points this paper is trying to address.

spoken dialogue generation
turn-taking
acoustic consistency
long-form stability
dialogue context modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

spoken dialogue synthesis
long-context modeling
zero-shot voice cloning
multi-party conversation
objective evaluation framework
🔎 Similar Papers
No similar papers found.
Y
Yuqian Zhang
D
Donghua Yu
Z
Zhengyuan Lin
B
Botian Jiang
M
Mingshu Chen
Y
Yaozhou Jiang
Y
Yiwei Zhao
Y
Yiyang Zhang
Y
Yucheng Yuan
H
Hanfu Chen
K
Kexin Huang
J
Jun Zhan
C
Cheng Chang
Zhaoye Fei
Zhaoye Fei
Fudan University
Natural Language Processing
Shimin Li
Shimin Li
Fudan University
Large Language ModelSpeech Language Model
X
Xiaogui Yang
Q
Qinyuan Cheng
X
Xipeng Qiu