CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for multi-speaker conversational speech synthesis struggle to simultaneously ensure speaker consistency, model overlapping speech, and enable efficient, coherent generation. This paper introduces the first zero-shot, fully non-autoregressive flow-matching framework that directly generates mel-spectrograms end-to-end from multi-stream text transcriptions. Key contributions include: (1) transcription-level speaker-disentangled representations to guarantee cross-utterance speaker consistency; (2) sentence-level temporal alignment to precisely control speech onset and offset; and (3) prompt-level stochastic masking to enhance zero-shot generalization and controllability. Experiments demonstrate that our method outperforms strong baselines—including MoonCast and Sesame—across speech quality, speaker fidelity, and inference speed. It enables real-time, high-fidelity, prompt-free multi-speaker conversational synthesis.

Technology Category

Application Category

📝 Abstract
Generating natural-sounding, multi-speaker dialogue is crucial for applications such as podcast creation, virtual agents, and multimedia content generation. However, existing systems struggle to maintain speaker consistency, model overlapping speech, and synthesize coherent conversations efficiently. In this paper, we introduce CoVoMix2, a fully non-autoregressive framework for zero-shot multi-talker dialogue generation. CoVoMix2 directly predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model, eliminating the reliance on intermediate token representations. To better capture realistic conversational dynamics, we propose transcription-level speaker disentanglement, sentence-level alignment, and prompt-level random masking strategies. Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame in speech quality, speaker consistency, and inference speed. Notably, CoVoMix2 operates without requiring transcriptions for the prompt and supports controllable dialogue generation, including overlapping speech and precise timing control, demonstrating strong generalizability to real-world speech generation scenarios.
Problem

Research questions and friction points this paper is trying to address.

Generating natural multi-speaker dialogue efficiently
Maintaining speaker consistency in conversations
Modeling overlapping speech and precise timing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fully non-autoregressive flow matching framework
Direct mel-spectrogram prediction from transcriptions
Speaker disentanglement and alignment strategies
🔎 Similar Papers