VoXtream2: Full-stream TTS with dynamic speaking rate control

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a zero-shot fully streaming text-to-speech (TTS) model that enables ultra-low-latency initiation and dynamic control during incremental text arrival. By aligning duration states through distribution matching, incorporating classifier-free guidance across conditional signals, and introducing a prompt-based text masking mechanism, the model supports transcription-free audio style transfer and real-time speaking rate adjustment mid-utterance. Evaluated on both standard zero-shot benchmarks and dedicated speaking-rate test sets, the method achieves speech quality comparable to existing baselines in both subjective and objective metrics. Notably, it delivers a 4× real-time synthesis speed and a first-packet latency of only 74 ms on consumer-grade GPUs, demonstrating strong practicality for interactive applications requiring responsive and adaptive speech generation.

Technology Category

Application Category

📝 Abstract
Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.
Problem

Research questions and friction points this paper is trying to address.

full-stream TTS
dynamic speaking rate control
interactive systems
zero-shot TTS
incremental text
Innovation

Methods, ideas, or system contributions that make the work stand out.

full-stream TTS
dynamic speaking-rate control
classifier-free guidance
textless audio prompting
zero-shot TTS
🔎 Similar Papers
No similar papers found.