DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation

πŸ“… 2025-10-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing continuous speech representations suffer from poor robustness under distributional shifts and limited controllability. To address this, we propose DiSTARβ€”the first zero-shot text-to-speech (TTS) framework operating entirely in the discrete residual vector quantization (RVQ) code space. DiSTAR innovatively couples block-level autoregressive modeling with a parallel masked diffusion model, eliminating the need for explicit duration prediction or forced alignment. It enables controllable generation via classifier-free guidance sampling and hierarchical RVQ code inference, supporting dynamic bitrate/computation pruning and multi-strategy decoding. Experiments demonstrate that DiSTAR significantly outperforms state-of-the-art zero-shot TTS methods, achieving superior naturalness, speaker consistency, synthesis robustness, and phonetic/expressive diversity.

Technology Category

Application Category

πŸ“ Abstract
Recent attempts to interleave autoregressive (AR) sketchers with diffusion-based refiners over continuous speech representations have shown promise, but they remain brittle under distribution shift and offer limited levers for controllability. We introduce DISTAR, a zero-shot text-to-speech framework that operates entirely in a discrete residual vector quantization (RVQ) code space and tightly couples an AR language model with a masked diffusion model, without forced alignment or a duration predictor. Concretely, DISTAR drafts block-level RVQ tokens with an AR language model and then performs parallel masked-diffusion infilling conditioned on the draft to complete the next block, yielding long-form synthesis with blockwise parallelism while mitigating classic AR exposure bias. The discrete code space affords explicit control at inference: DISTAR produces high-quality audio under both greedy and sample-based decoding using classifier-free guidance, supports trade-offs between robustness and diversity, and enables variable bit-rate and controllable computation via RVQ layer pruning at test time. Extensive experiments and ablations demonstrate that DISTAR surpasses state-of-the-art zero-shot TTS systems in robustness, naturalness, and speaker/style consistency, while maintaining rich output diversity. Audio samples are provided on https://anonymous.4open.science/w/DiSTAR_demo.
Problem

Research questions and friction points this paper is trying to address.

Developing robust text-to-speech synthesis resistant to distribution shifts
Enhancing controllability without forced alignment or duration predictors
Achieving parallel long-form generation while mitigating exposure bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete RVQ code space AR-diffusion coupling
Block-level parallel masked-diffusion infilling
Classifier-free guidance with RVQ layer pruning
πŸ”Ž Similar Papers
No similar papers found.
Y
Yakun Song
X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University
Xiaobin Zhuang
Xiaobin Zhuang
Bytedance
Audio Generation
J
Jiawei Chen
ByteDance Inc.
Zhikang Niu
Zhikang Niu
Shanghai Jiao Tong University
Speech Synthesis
Guanrou Yang
Guanrou Yang
Shanghai Jiao Tong University
Chenpeng Du
Chenpeng Du
ByteDance
Speech Interaction
Z
Zhuo Chen
ByteDance Inc.
Y
Yuping Wang
ByteDance Inc.
Y
Yuxuan Wang
ByteDance Inc.
X
Xie Chen
X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University