🤖 AI Summary
This work proposes a zero-shot fully streaming text-to-speech (TTS) model that enables ultra-low-latency initiation and dynamic control during incremental text arrival. By aligning duration states through distribution matching, incorporating classifier-free guidance across conditional signals, and introducing a prompt-based text masking mechanism, the model supports transcription-free audio style transfer and real-time speaking rate adjustment mid-utterance. Evaluated on both standard zero-shot benchmarks and dedicated speaking-rate test sets, the method achieves speech quality comparable to existing baselines in both subjective and objective metrics. Notably, it delivers a 4× real-time synthesis speed and a first-packet latency of only 74 ms on consumer-grade GPUs, demonstrating strong practicality for interactive applications requiring responsive and adaptive speech generation.
📝 Abstract
Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.