🤖 AI Summary
This work addresses the tension between the low training efficiency of autoregressive models and the slow inference speed of diffusion models by proposing the Causal Autoregressive Diffusion (CARD) framework. CARD reformulates the diffusion process using a strictly causal attention mask, enabling dense per-token supervision within a single forward pass and supporting dynamic parallel decoding. It introduces a novel soft-tail masking strategy to preserve local context and incorporates a signal-to-noise-ratio-based, context-aware reweighting mechanism to enhance optimization stability and support variable-length generation. Experiments demonstrate that CARD maintains the data efficiency of autoregressive models while reducing training latency by 3×, significantly outperforming existing discrete diffusion approaches and achieving both high-throughput inference and low-latency generation.
📝 Abstract
In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.