Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional autoregressive speech synthesis models suffer from unstable inter-frame attention, high latency, and degraded audio quality when modeling long sequences, hindering real-time deployment. To address these limitations, we propose Dynamic Chunked Autoregressive Synthesis (DCAR), a novel framework featuring three key innovations: (1) a dynamic chunk-to-frame attention mechanism that adaptively adjusts prediction span, alleviating strict frame-wise dependency; (2) a lightweight policy module that guides variable-length chunk generation; and (3) multi-token prediction during training to enhance modeling efficiency. Experiments demonstrate that DCAR achieves up to a 72.27% improvement in intelligibility and 2.61× inference speedup over state-of-the-art autoregressive baselines, while maintaining high audio fidelity. The framework exhibits strong robustness and real-time capability, making it particularly suitable for low-latency applications.

Technology Category

Application Category

📝 Abstract
Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention, leading to increased latency and degraded synthesis quality, thereby limiting their feasibility for real-time applications. To address these limitations, we introduce a novel dynamic chunk-wise autoregressive synthesis framework, termed DCAR, designed to enhance both efficiency and intelligibility robustness in AR speech generation. DCAR introduces a chunk-to-frame attention mechanism through training with multi-token prediction, enabling dynamic chunk prediction in variable speech contexts using a lightweight module trained on-policy. DCAR dynamically adjusts the token prediction span, significantly reducing the sequence length dependency while obtaining high synthesis quality. Comprehensive empirical evaluations demonstrate that DCAR substantially outperforms traditional next-token prediction models, achieving up to 72.27% intelligibility improvement and 2.61x inference speedup simultaneously on the test set. Furthermore, we conduct comprehensive analysis to support it as a versatile foundation for next-generation speech synthesis systems.
Problem

Research questions and friction points this paper is trying to address.

Improve efficiency in autoregressive speech synthesis
Enhance intelligibility robustness in long sequences
Reduce latency for real-time speech applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic chunk-wise autoregressive synthesis framework
Chunk-to-frame attention mechanism
Dynamic token prediction span adjustment
🔎 Similar Papers
No similar papers found.
B
Bohan Li
X-LANCE Lab, School of Computer Science; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
Zhihan Li
Zhihan Li
Kuaishou Technology, Tsinghua University
Anomaly DetectionAIOps
H
Haoran Wang
X-LANCE Lab, School of Computer Science; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
Hanglei Zhang
Hanglei Zhang
Shanghai Jiao Tong University
Y
Yiwei Guo
X-LANCE Lab, School of Computer Science; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
Hankun Wang
Hankun Wang
Shanghai Jiao Tong University
Speech Synthesis
X
Xie Chen
X-LANCE Lab, School of Computer Science; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
K
Kai Yu
X-LANCE Lab, School of Computer Science; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China; MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing