StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling

📅 2025-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot streaming TTS systems predominantly rely on multi-stage discrete modeling, resulting in high latency, substantial computational overhead, and constrained speech quality. This paper proposes the first single-stage, zero-shot, low-latency streaming TTS framework, which innovatively employs interleaved autoregressive modeling over continuous mel-spectrograms: text tokens and acoustic frames are alternately fed into the model, integrated with streaming attention and a zero-shot speaker adaptation mechanism to enable end-to-end online speech synthesis. Evaluated on LibriSpeech, our method significantly outperforms existing streaming baselines; speech naturalness and speaker similarity approach those of offline systems, while end-to-end latency is reduced by over 50%. To the best of our knowledge, this is the first work achieving high-fidelity real-time synthesis without compromising strong cross-speaker generalization capability.

Technology Category

Application Category

📝 Abstract
Recent advances in zero-shot text-to-speech (TTS) synthesis have achieved high-quality speech generation for unseen speakers, but most systems remain unsuitable for real-time applications because of their offline design. Current streaming TTS paradigms often rely on multi-stage pipelines and discrete representations, leading to increased computational cost and suboptimal system performance. In this work, we propose StreamMel, a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms. By interleaving text tokens with acoustic frames, StreamMel enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness. Experiments on LibriSpeech demonstrate that StreamMel outperforms existing streaming TTS baselines in both quality and latency. It even achieves performance comparable to offline systems while supporting efficient real-time generation, showcasing broad prospects for integration with real-time speech large language models. Audio samples are available at: https://aka.ms/StreamMel.
Problem

Research questions and friction points this paper is trying to address.

Real-time zero-shot TTS synthesis for unseen speakers
Single-stage streaming TTS with continuous mel-spectrograms
Low-latency autoregressive synthesis with high speaker similarity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-stage streaming TTS framework
Interleaved text and acoustic modeling
Continuous mel-spectrogram autoregressive synthesis
🔎 Similar Papers
No similar papers found.
H
Hui Wang
College of Computer Science, Nankai University, China
Y
Yifan Yang
Microsoft Corporation
S
Shujie Liu
Microsoft Corporation
Jinyu Li
Jinyu Li
Partner Applied Science Manager, Microsoft
Acoustic ModelingSpeech RecognitionSpeech Translation
Lingwei Meng
Lingwei Meng
ByteDance; The Chinese University of Hong Kong
Speech and Language ProcessingSpeech RecognitionSpeech Synthesis
Yanqing Liu
Yanqing Liu
Microsoft Corporation
Text-to-Speech Speech Recognition Speech Edition Overdubbing NLP
J
Jiaming Zhou
College of Computer Science, Nankai University, China
Haoqin Sun
Haoqin Sun
Nankai University
Affective computingSpeech signal processingAudio understanding
Y
Yan Lu
Microsoft Corporation
Y
Yong Qin
College of Computer Science, Nankai University, China