Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing autoregressive text-to-speech (TTS) methods rely on single-codebook representations, leading to severe loss of fine-grained acoustic information—such as prosodic nuances and speaker timbre—particularly detrimental in singing and music synthesis. To address this, we propose QTTS, a novel framework built upon QDAC, a residual vector quantization-based audio codec enabling low-distortion, high-fidelity speech reconstruction. QTTS innovatively integrates ASR-guided autoregressive modeling with adversarial GAN training, and introduces a hierarchical parallel architecture coupled with a latency-aware multi-head mechanism to effectively capture cross-codebook dependencies and accelerate inference. Experiments demonstrate that QTTS significantly outperforms state-of-the-art baselines in naturalness, expressiveness, and performance under complex acoustic conditions (e.g., singing synthesis). Notably, QTTS is the first end-to-end autoregressive TTS system to jointly achieve high-fidelity timbre reproduction and precise prosody modeling.

Technology Category

Application Category

📝 Abstract
Text-to-speech (TTS) synthesis has seen renewed progress under the discrete modeling paradigm. Existing autoregressive approaches often rely on single-codebook representations, which suffer from significant information loss. Even with post-hoc refinement techniques such as flow matching, these methods fail to recover fine-grained details (e.g., prosodic nuances, speaker-specific timbres), especially in challenging scenarios like singing voice or music synthesis. We propose QTTS, a novel TTS framework built upon our new audio codec, QDAC. The core innovation of QDAC lies in its end-to-end training of an ASR-based auto-regressive network with a GAN, which achieves superior semantic feature disentanglement for scalable, near-lossless compression. QTTS models these discrete codes using two innovative strategies: the Hierarchical Parallel architecture, which uses a dual-AR structure to model inter-codebook dependencies for higher-quality synthesis, and the Delay Multihead approach, which employs parallelized prediction with a fixed delay to accelerate inference speed. Our experiments demonstrate that the proposed framework achieves higher synthesis quality and better preserves expressive content compared to baseline. This suggests that scaling up compression via multi-codebook modeling is a promising direction for high-fidelity, general-purpose speech and audio generation.
Problem

Research questions and friction points this paper is trying to address.

Address information loss in single-codebook TTS synthesis
Recover fine-grained details in challenging audio scenarios
Improve synthesis quality and preserve expressive content
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end training with ASR-based auto-regressive GAN
Hierarchical Parallel dual-AR structure for dependencies
Delay Multihead parallel prediction for speed
🔎 Similar Papers
No similar papers found.
Y
Yichen Han
Xiaoyang Hao
Xiaoyang Hao
Tencent
speech synthesis
K
Keming Chen
W
Weibo Xiong
J
Jun He
R
Ruonan Zhang
Junjie Cao
Junjie Cao
School of Mathematical Sciences, Dalian University of Technology
Computer GraphicsComputer VisionMachine Learning
Y
Yue Liu
B
Bowen Li
D
Dongrui Zhang
H
Hui Xia
H
Huilei Fu
Kai Jia
Kai Jia
MIT
K
Kaixuan Guo
M
Mingli Jin
Q
Qingyun Meng
R
Ruidong Ma
R
Ruiqian Fang
S
Shaotong Guo
X
Xuhui Li
Y
Yang Xiang
Y
Ying Zhang
Y
Yulong Liu
Y
Yunfeng Li
Yuyi Zhang
Yuyi Zhang
South China University of Technology
Computer VisionDiffusionImage generationHandwritten Character RecognitionOCR