Next Tokens Denoising for Speech Synthesis

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Autoregressive models in text-to-speech synthesis suffer from limited future-context modeling and slow inference, while diffusion models lack native support for key-value (KV) caching. To address these limitations, we propose Dragon-FM—the first framework to introduce continuous flow matching into discrete audio token prediction, unifying autoregressive and flow-matching paradigms. Methodologically, Dragon-FM employs intra-block parallel denoising to model future context and inter-block autoregressive modeling to enable cross-block KV caching—thereby balancing generation quality, inference speed, and long-range coherence. Integrated with scalar quantization and an efficient codec, it achieves 12.5 tokens/s inference on 48 kHz audio. Evaluated on a podcast dataset, Dragon-FM enables high-fidelity zero-shot speech synthesis, significantly improving both efficiency and naturalness for long-text generation. This work establishes a novel, efficient end-to-end paradigm for neural speech synthesis.

Technology Category

Application Category

📝 Abstract
While diffusion and autoregressive (AR) models have significantly advanced generative modeling, they each present distinct limitations. AR models, which rely on causal attention, cannot exploit future context and suffer from slow generation speeds. Conversely, diffusion models struggle with key-value (KV) caching. To overcome these challenges, we introduce Dragon-FM, a novel text-to-speech (TTS) design that unifies AR and flow-matching. This model processes 48 kHz audio codec tokens in chunks at a compact 12.5 tokens per second rate. This design enables AR modeling across chunks, ensuring global coherence, while parallel flow-matching within chunks facilitates fast iterative denoising. Consequently, the proposed model can utilize KV-cache across chunks and incorporate future context within each chunk. Furthermore, it bridges continuous and discrete feature modeling, demonstrating that continuous AR flow-matching can predict discrete tokens with finite scalar quantizers. This efficient codec and fast chunk-autoregressive architecture also makes the proposed model particularly effective for generating extended content. Experiment for demos of our work} on podcast datasets demonstrate its capability to efficiently generate high-quality zero-shot podcasts.
Problem

Research questions and friction points this paper is trying to address.

AR models lack future context and slow generation
Diffusion models struggle with KV caching
Unify AR and flow-matching for efficient TTS
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies AR and flow-matching for TTS
Processes audio tokens in chunks efficiently
Enables KV-cache and future context usage
🔎 Similar Papers
No similar papers found.
Yanqing Liu
Yanqing Liu
Microsoft Corporation
Text-to-Speech Speech Recognition Speech Edition Overdubbing NLP
R
Ruiqing Xue
Microsoft
C
Chong Zhang
Microsoft
Y
Yufei Liu
Microsoft
G
Gang Wang
Microsoft
B
Bohan Li
Microsoft
Yao Qian
Yao Qian
Microsoft
Deep Learning - Spoken Language Processing - Computer Aided Language Learning - Deep Learning
L
Lei He
Microsoft
S
Shujie Liu
Microsoft
Sheng Zhao
Sheng Zhao
Microsoft
Speech