ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

📅 2024-12-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

To address information loss from discrete tokenization and limited long-horizon generation capability in continuous visual modeling, this paper proposes ACDiT—the first block-level autoregressive-diffusion hybrid architecture. Methodologically, it introduces a block-wise autoregressive unit enabling flexible interpolation between token-level prediction and full-sequence denoising; incorporates Skip-Causal Attention Masking (SCAM) and block-conditional diffusion modeling, where prior blocks condition continuous pixel generation; and supports KV caching for accelerated inference, enabling alternating iterations of autoregressive decoding and diffusion denoising. Experimentally, ACDiT outperforms same-scale autoregressive baselines on image and video generation. Its pretrained models achieve zero-shot transfer to visual understanding tasks. Crucially, the work empirically uncovers, for the first time, the fundamental trade-off between autoregressive precision and diffusion robustness in long-horizon generation.

Technology Category

Application Category

📝 Abstract

We present ACDiT, a novel Autoregressive blockwise Conditional Diffusion Transformer, that innovatively combines autoregressive and diffusion paradigms for modeling continuous visual information. By introducing a block-wise autoregressive unit, ACDiT offers a flexible interpolation between token-wise autoregression and full-sequence diffusion, bypassing the limitations of discrete tokenization. The generation of each block is formulated as a conditional diffusion process, conditioned on prior blocks. ACDiT is easy to implement, as simple as creating a Skip-Causal Attention Mask (SCAM) on standard diffusion transformer during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We show that ACDiT performs best among all autoregressive baselines under similar model scales on image and video generation tasks. We also demonstrate that benefiting from autoregressive modeling, pretrained ACDiT can be transferred in visual understanding tasks despite being trained with the diffusion objective. The analysis of the trade-off between autoregressive modeling and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. We hope that ACDiT offers a novel perspective on visual autoregressive generation and unlocks new avenues for unified models.

Problem

Research questions and friction points this paper is trying to address.

Combines autoregressive and diffusion models for visual generation.

Overcomes limitations of discrete tokenization in visual modeling.

Enables transfer learning from generation to visual understanding tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines autoregressive and diffusion paradigms

Uses block-wise autoregressive conditional diffusion

Implements Skip-Causal Attention Mask (SCAM)

🔎 Similar Papers

TEncDM: Understanding the Properties of the Diffusion Model in the Space of Language Model Encodings