VibeVoice Technical Report

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This paper addresses the challenge of simultaneously achieving high audio quality, computational efficiency, and conversational authenticity in long-form multi-speaker text-to-speech synthesis. To this end, we propose a unified modeling framework based on next-token diffusion. Our key contributions are threefold: (1) the first continuous speech tokenizer, enabling 80× data compression while preserving high-fidelity reconstruction and supporting low-latency inference; (2) an autoregressive latent-variable generation architecture with a 64K-token context window, facilitating robust modeling of ultra-long sequences; and (3) the first end-to-end natural dialogue speech synthesis system capable of generating up to 90-minute conversations involving up to four speakers. Evaluated on both open and closed benchmarks, our method achieves state-of-the-art performance, significantly improving prosodic coherence, speaker distinguishability, and ambient realism in extended dialogues.

Technology Category

Application Category

📝 Abstract

This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.

Problem

Research questions and friction points this paper is trying to address.

Synthesizing long-form multi-speaker speech

Improving compression for efficient audio processing

Maintaining audio fidelity in long sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Next-token diffusion for continuous latent vector modeling

Novel continuous speech tokenizer improving compression 80x

Synthesizes 90-minute multi-speaker conversations efficiently

🔎 Similar Papers

No similar papers found.