🤖 AI Summary
This paper addresses the challenge of simultaneously achieving high audio quality, computational efficiency, and conversational authenticity in long-form multi-speaker text-to-speech synthesis. To this end, we propose a unified modeling framework based on next-token diffusion. Our key contributions are threefold: (1) the first continuous speech tokenizer, enabling 80× data compression while preserving high-fidelity reconstruction and supporting low-latency inference; (2) an autoregressive latent-variable generation architecture with a 64K-token context window, facilitating robust modeling of ultra-long sequences; and (3) the first end-to-end natural dialogue speech synthesis system capable of generating up to 90-minute conversations involving up to four speakers. Evaluated on both open and closed benchmarks, our method achieves state-of-the-art performance, significantly improving prosodic coherence, speaker distinguishability, and ambient realism in extended dialogues.
📝 Abstract
This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.