ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

📅 2025-07-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the slow inference, unstable speaker turn-taking, and lack of large-scale open-source datasets in spoken dialogue generation, this paper proposes the first non-autoregressive zero-shot spoken dialogue generation framework. Methodologically, it introduces speaker-turn embeddings, a curriculum learning strategy, and a stereo end-to-end synthesis architecture, leveraging flow matching to jointly model speech-text alignment, zero-shot speaker adaptation, and high-fidelity speech generation. We construct OpenDialog—the first large-scale open-source spoken dialogue dataset—and establish a comprehensive benchmark evaluating intelligibility, speaker similarity, turn-taking accuracy, and inference efficiency. Experiments demonstrate that our approach achieves several-fold faster inference while maintaining speech quality, significantly outperforming state-of-the-art methods across all metrics. The code, models, and dataset are fully open-sourced.

Technology Category

Application Category

📝 Abstract
Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation models, being auto-regressive, suffer from slow and unstable inference. To overcome these limitations, we introduce ZipVoice-Dialog, a non-autoregressive zero-shot spoken dialogue generation model built upon flow matching. Key designs include: 1) speaker-turn embeddings for precise speaker turn-taking; 2) a curriculum learning strategy for stable speech-text alignment; 3) specialized strategies to enable stereo dialogue generation. Additionally, recognizing the lack of open-source large-scale spoken dialogue datasets, we curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data. Furthermore, we established a benchmark to comprehensively evaluate various models. Experimental results demonstrate that ZipVoice-Dialog achieves superior performance in intelligibility, speaker turn-taking accuracy, speaker similarity, and inference speed. Our codes, model checkpoints, demo samples, and the OpenDialog dataset are all publicly available at https://github.com/k2-fsa/ZipVoice.
Problem

Research questions and friction points this paper is trying to address.

Generate realistic spoken dialogue with distinct speaker timbres
Overcome slow and unstable inference in autoregressive models
Address lack of large-scale open-source dialogue datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-autoregressive flow matching for dialogue generation
Speaker-turn embeddings for accurate turn-taking
Curriculum learning for stable speech-text alignment
🔎 Similar Papers
No similar papers found.
H
Han Zhu
Xiaomi Corp., Beijing, China
W
Wei Kang
Xiaomi Corp., Beijing, China
Liyong Guo
Liyong Guo
Unknown affiliation
Z
Zengwei Yao
Xiaomi Corp., Beijing, China
Fangjun Kuang
Fangjun Kuang
Xiaomi
Speech recognition
W
Weiji Zhuang
Xiaomi Corp., Beijing, China
Zhaoqing Li
Zhaoqing Li
The Chinese university of Hong Kong
Speech RecognitionMachine LearningModel Compression
Z
Zhifeng Han
Xiaomi Corp., Beijing, China
D
Dong Zhang
Xiaomi Corp., Beijing, China
X
Xin Zhang
Xiaomi Corp., Beijing, China
X
Xingchen Song
Xiaomi Corp., Beijing, China
Long Lin
Long Lin
Georgia Institute of Technology
Energy HarvestingPiezoelectricTriboelectricSelf-Powered System
Daniel Povey
Daniel Povey
Chief Speech Scientist, Xiaomi Corp.
Speech Recognition