🤖 AI Summary
This work proposes Dynamic Chunking Diffusion Transformer (DC-DiT), a novel architecture that overcomes the limitation of fixed image patching in conventional diffusion Transformers by enabling adaptive token compression conditioned on both visual content and denoising timestep. DC-DiT employs an end-to-end learnable encoder-router-decoder framework to perform content- and time-aware token reduction, achieving semantic-aware image chunking without explicit supervision and dynamically adjusting token count throughout the diffusion process. Built upon the DiT backbone, DC-DiT facilitates efficient transferability and dynamic computational scaling. Experiments on ImageNet 256×256 demonstrate consistent improvements in FID and Inception Score over parameter- and FLOP-matched DiT baselines under 4× and 16× compression ratios, with training steps reduced by up to 8×.
📝 Abstract
Diffusion Transformers process images as fixed-length sequences of tokens produced by a static $\textit{patchify}$ operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet $256{\times}256$, DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across $4{\times}$ and $16{\times}$ compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to $8{\times}$ fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.