🤖 AI Summary
To address the slow inference, unstable speaker turn-taking, and lack of large-scale open-source datasets in spoken dialogue generation, this paper proposes the first non-autoregressive zero-shot spoken dialogue generation framework. Methodologically, it introduces speaker-turn embeddings, a curriculum learning strategy, and a stereo end-to-end synthesis architecture, leveraging flow matching to jointly model speech-text alignment, zero-shot speaker adaptation, and high-fidelity speech generation. We construct OpenDialog—the first large-scale open-source spoken dialogue dataset—and establish a comprehensive benchmark evaluating intelligibility, speaker similarity, turn-taking accuracy, and inference efficiency. Experiments demonstrate that our approach achieves several-fold faster inference while maintaining speech quality, significantly outperforming state-of-the-art methods across all metrics. The code, models, and dataset are fully open-sourced.
📝 Abstract
Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation models, being auto-regressive, suffer from slow and unstable inference. To overcome these limitations, we introduce ZipVoice-Dialog, a non-autoregressive zero-shot spoken dialogue generation model built upon flow matching. Key designs include: 1) speaker-turn embeddings for precise speaker turn-taking; 2) a curriculum learning strategy for stable speech-text alignment; 3) specialized strategies to enable stereo dialogue generation. Additionally, recognizing the lack of open-source large-scale spoken dialogue datasets, we curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data. Furthermore, we established a benchmark to comprehensively evaluate various models. Experimental results demonstrate that ZipVoice-Dialog achieves superior performance in intelligibility, speaker turn-taking accuracy, speaker similarity, and inference speed. Our codes, model checkpoints, demo samples, and the OpenDialog dataset are all publicly available at https://github.com/k2-fsa/ZipVoice.