Causal Autoregressive Diffusion Language Model

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the tension between the low training efficiency of autoregressive models and the slow inference speed of diffusion models by proposing the Causal Autoregressive Diffusion (CARD) framework. CARD reformulates the diffusion process using a strictly causal attention mask, enabling dense per-token supervision within a single forward pass and supporting dynamic parallel decoding. It introduces a novel soft-tail masking strategy to preserve local context and incorporates a signal-to-noise-ratio-based, context-aware reweighting mechanism to enhance optimization stability and support variable-length generation. Experiments demonstrate that CARD maintains the data efficiency of autoregressive models while reducing training latency by 3×, significantly outperforming existing discrete diffusion approaches and achieving both high-throughput inference and low-latency generation.

Technology Category

Application Category

📝 Abstract

In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.

Problem

Research questions and friction points this paper is trying to address.

Causal Autoregressive Diffusion

Training Efficiency

Parallel Inference

Diffusion Language Model

Efficient LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Autoregressive Diffusion

Dynamic Parallel Decoding

Soft-tailed Masking