StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling

📅 2025-06-14

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Existing zero-shot streaming TTS systems predominantly rely on multi-stage discrete modeling, resulting in high latency, substantial computational overhead, and constrained speech quality. This paper proposes the first single-stage, zero-shot, low-latency streaming TTS framework, which innovatively employs interleaved autoregressive modeling over continuous mel-spectrograms: text tokens and acoustic frames are alternately fed into the model, integrated with streaming attention and a zero-shot speaker adaptation mechanism to enable end-to-end online speech synthesis. Evaluated on LibriSpeech, our method significantly outperforms existing streaming baselines; speech naturalness and speaker similarity approach those of offline systems, while end-to-end latency is reduced by over 50%. To the best of our knowledge, this is the first work achieving high-fidelity real-time synthesis without compromising strong cross-speaker generalization capability.

Technology Category

Application Category

📝 Abstract

Recent advances in zero-shot text-to-speech (TTS) synthesis have achieved high-quality speech generation for unseen speakers, but most systems remain unsuitable for real-time applications because of their offline design. Current streaming TTS paradigms often rely on multi-stage pipelines and discrete representations, leading to increased computational cost and suboptimal system performance. In this work, we propose StreamMel, a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms. By interleaving text tokens with acoustic frames, StreamMel enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness. Experiments on LibriSpeech demonstrate that StreamMel outperforms existing streaming TTS baselines in both quality and latency. It even achieves performance comparable to offline systems while supporting efficient real-time generation, showcasing broad prospects for integration with real-time speech large language models. Audio samples are available at: https://aka.ms/StreamMel.

Problem

Research questions and friction points this paper is trying to address.

Real-time zero-shot TTS synthesis for unseen speakers

Single-stage streaming TTS with continuous mel-spectrograms

Low-latency autoregressive synthesis with high speaker similarity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-stage streaming TTS framework

Interleaved text and acoustic modeling

Continuous mel-spectrogram autoregressive synthesis

🔎 Similar Papers

No similar papers found.