MOSS-TTSD: Text to Spoken Dialogue Generation

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge of generating multi-speaker conversational speech with natural turn-taking, cross-turn acoustic consistency, and long-term stability—capabilities lacking in current text-to-speech (TTS) systems. We propose the first end-to-end TTS model capable of synthesizing ultra-long, multi-party dialogues up to 60 minutes involving up to five speakers, using only scripts annotated with explicit speaker labels. The system supports zero-shot voice cloning and multilingual dialogue generation in a single inference pass. To evaluate performance without reliance on external speaker diarization tools, we introduce TTSD-eval, a novel objective evaluation framework, alongside an enhanced long-context modeling architecture and a forced-alignment-based assessment methodology. Experimental results demonstrate that our approach significantly outperforms both open-source and commercial baselines in terms of subjective naturalness and objective consistency metrics.

Technology Category

Application Category

📝 Abstract

Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.

Problem

Research questions and friction points this paper is trying to address.

spoken dialogue generation

turn-taking

acoustic consistency

long-form stability

dialogue context modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

spoken dialogue synthesis

long-context modeling

zero-shot voice cloning