ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

📅 2025-07-12

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the slow inference, unstable speaker turn-taking, and lack of large-scale open-source datasets in spoken dialogue generation, this paper proposes the first non-autoregressive zero-shot spoken dialogue generation framework. Methodologically, it introduces speaker-turn embeddings, a curriculum learning strategy, and a stereo end-to-end synthesis architecture, leveraging flow matching to jointly model speech-text alignment, zero-shot speaker adaptation, and high-fidelity speech generation. We construct OpenDialog—the first large-scale open-source spoken dialogue dataset—and establish a comprehensive benchmark evaluating intelligibility, speaker similarity, turn-taking accuracy, and inference efficiency. Experiments demonstrate that our approach achieves several-fold faster inference while maintaining speech quality, significantly outperforming state-of-the-art methods across all metrics. The code, models, and dataset are fully open-sourced.

Technology Category

Application Category

📝 Abstract

Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation models, being auto-regressive, suffer from slow and unstable inference. To overcome these limitations, we introduce ZipVoice-Dialog, a non-autoregressive zero-shot spoken dialogue generation model built upon flow matching. Key designs include: 1) speaker-turn embeddings for precise speaker turn-taking; 2) a curriculum learning strategy for stable speech-text alignment; 3) specialized strategies to enable stereo dialogue generation. Additionally, recognizing the lack of open-source large-scale spoken dialogue datasets, we curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data. Furthermore, we established a benchmark to comprehensively evaluate various models. Experimental results demonstrate that ZipVoice-Dialog achieves superior performance in intelligibility, speaker turn-taking accuracy, speaker similarity, and inference speed. Our codes, model checkpoints, demo samples, and the OpenDialog dataset are all publicly available at https://github.com/k2-fsa/ZipVoice.

Problem

Research questions and friction points this paper is trying to address.

Generate realistic spoken dialogue with distinct speaker timbres

Overcome slow and unstable inference in autoregressive models

Address lack of large-scale open-source dialogue datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-autoregressive flow matching for dialogue generation

Speaker-turn embeddings for accurate turn-taking

Curriculum learning for stable speech-text alignment

🔎 Similar Papers

No similar papers found.