🤖 AI Summary
Traditional autoregressive speech synthesis models suffer from unstable inter-frame attention, high latency, and degraded audio quality when modeling long sequences, hindering real-time deployment. To address these limitations, we propose Dynamic Chunked Autoregressive Synthesis (DCAR), a novel framework featuring three key innovations: (1) a dynamic chunk-to-frame attention mechanism that adaptively adjusts prediction span, alleviating strict frame-wise dependency; (2) a lightweight policy module that guides variable-length chunk generation; and (3) multi-token prediction during training to enhance modeling efficiency. Experiments demonstrate that DCAR achieves up to a 72.27% improvement in intelligibility and 2.61× inference speedup over state-of-the-art autoregressive baselines, while maintaining high audio fidelity. The framework exhibits strong robustness and real-time capability, making it particularly suitable for low-latency applications.
📝 Abstract
Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention, leading to increased latency and degraded synthesis quality, thereby limiting their feasibility for real-time applications. To address these limitations, we introduce a novel dynamic chunk-wise autoregressive synthesis framework, termed DCAR, designed to enhance both efficiency and intelligibility robustness in AR speech generation. DCAR introduces a chunk-to-frame attention mechanism through training with multi-token prediction, enabling dynamic chunk prediction in variable speech contexts using a lightweight module trained on-policy. DCAR dynamically adjusts the token prediction span, significantly reducing the sequence length dependency while obtaining high synthesis quality. Comprehensive empirical evaluations demonstrate that DCAR substantially outperforms traditional next-token prediction models, achieving up to 72.27% intelligibility improvement and 2.61x inference speedup simultaneously on the test set. Furthermore, we conduct comprehensive analysis to support it as a versatile foundation for next-generation speech synthesis systems.