🤖 AI Summary
Existing zero-shot streaming TTS systems predominantly rely on multi-stage discrete modeling, resulting in high latency, substantial computational overhead, and constrained speech quality. This paper proposes the first single-stage, zero-shot, low-latency streaming TTS framework, which innovatively employs interleaved autoregressive modeling over continuous mel-spectrograms: text tokens and acoustic frames are alternately fed into the model, integrated with streaming attention and a zero-shot speaker adaptation mechanism to enable end-to-end online speech synthesis. Evaluated on LibriSpeech, our method significantly outperforms existing streaming baselines; speech naturalness and speaker similarity approach those of offline systems, while end-to-end latency is reduced by over 50%. To the best of our knowledge, this is the first work achieving high-fidelity real-time synthesis without compromising strong cross-speaker generalization capability.
📝 Abstract
Recent advances in zero-shot text-to-speech (TTS) synthesis have achieved high-quality speech generation for unseen speakers, but most systems remain unsuitable for real-time applications because of their offline design. Current streaming TTS paradigms often rely on multi-stage pipelines and discrete representations, leading to increased computational cost and suboptimal system performance. In this work, we propose StreamMel, a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms. By interleaving text tokens with acoustic frames, StreamMel enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness. Experiments on LibriSpeech demonstrate that StreamMel outperforms existing streaming TTS baselines in both quality and latency. It even achieves performance comparable to offline systems while supporting efficient real-time generation, showcasing broad prospects for integration with real-time speech large language models. Audio samples are available at: https://aka.ms/StreamMel.