๐ค AI Summary
To address information loss from discrete tokenization and limited long-horizon generation capability in continuous visual modeling, this paper proposes ACDiTโthe first block-level autoregressive-diffusion hybrid architecture. Methodologically, it introduces a block-wise autoregressive unit enabling flexible interpolation between token-level prediction and full-sequence denoising; incorporates Skip-Causal Attention Masking (SCAM) and block-conditional diffusion modeling, where prior blocks condition continuous pixel generation; and supports KV caching for accelerated inference, enabling alternating iterations of autoregressive decoding and diffusion denoising. Experimentally, ACDiT outperforms same-scale autoregressive baselines on image and video generation. Its pretrained models achieve zero-shot transfer to visual understanding tasks. Crucially, the work empirically uncovers, for the first time, the fundamental trade-off between autoregressive precision and diffusion robustness in long-horizon generation.
๐ Abstract
We present ACDiT, a novel Autoregressive blockwise Conditional Diffusion Transformer, that innovatively combines autoregressive and diffusion paradigms for modeling continuous visual information. By introducing a block-wise autoregressive unit, ACDiT offers a flexible interpolation between token-wise autoregression and full-sequence diffusion, bypassing the limitations of discrete tokenization. The generation of each block is formulated as a conditional diffusion process, conditioned on prior blocks. ACDiT is easy to implement, as simple as creating a Skip-Causal Attention Mask (SCAM) on standard diffusion transformer during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We show that ACDiT performs best among all autoregressive baselines under similar model scales on image and video generation tasks. We also demonstrate that benefiting from autoregressive modeling, pretrained ACDiT can be transferred in visual understanding tasks despite being trained with the diffusion objective. The analysis of the trade-off between autoregressive modeling and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. We hope that ACDiT offers a novel perspective on visual autoregressive generation and unlocks new avenues for unified models.