Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Traditional autoregressive speech synthesis models suffer from unstable inter-frame attention, high latency, and degraded audio quality when modeling long sequences, hindering real-time deployment. To address these limitations, we propose Dynamic Chunked Autoregressive Synthesis (DCAR), a novel framework featuring three key innovations: (1) a dynamic chunk-to-frame attention mechanism that adaptively adjusts prediction span, alleviating strict frame-wise dependency; (2) a lightweight policy module that guides variable-length chunk generation; and (3) multi-token prediction during training to enhance modeling efficiency. Experiments demonstrate that DCAR achieves up to a 72.27% improvement in intelligibility and 2.61× inference speedup over state-of-the-art autoregressive baselines, while maintaining high audio fidelity. The framework exhibits strong robustness and real-time capability, making it particularly suitable for low-latency applications.

Technology Category

Application Category

📝 Abstract

Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention, leading to increased latency and degraded synthesis quality, thereby limiting their feasibility for real-time applications. To address these limitations, we introduce a novel dynamic chunk-wise autoregressive synthesis framework, termed DCAR, designed to enhance both efficiency and intelligibility robustness in AR speech generation. DCAR introduces a chunk-to-frame attention mechanism through training with multi-token prediction, enabling dynamic chunk prediction in variable speech contexts using a lightweight module trained on-policy. DCAR dynamically adjusts the token prediction span, significantly reducing the sequence length dependency while obtaining high synthesis quality. Comprehensive empirical evaluations demonstrate that DCAR substantially outperforms traditional next-token prediction models, achieving up to 72.27% intelligibility improvement and 2.61x inference speedup simultaneously on the test set. Furthermore, we conduct comprehensive analysis to support it as a versatile foundation for next-generation speech synthesis systems.

Problem

Research questions and friction points this paper is trying to address.

Improve efficiency in autoregressive speech synthesis

Enhance intelligibility robustness in long sequences

Reduce latency for real-time speech applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic chunk-wise autoregressive synthesis framework

Chunk-to-frame attention mechanism

Dynamic token prediction span adjustment

🔎 Similar Papers

No similar papers found.