🤖 AI Summary
This work addresses the strong coupling between computational cost and image resolution in Diffusion Transformers (DiTs), which uniformly allocate computation and thus struggle to balance latency and generation quality. To overcome this limitation, the authors propose the Elastic Latent Interface (ELIT), which decouples resolution from computation by introducing variable-length latent token sequences and employs a lightweight read-write cross-attention mechanism to dynamically focus on salient regions. ELIT enables a single DiT model to support multiple computational budgets: during training, it learns token importance by randomly dropping trailing tokens, and at inference, it dynamically adjusts computational cost without modifying the backbone architecture. Evaluated on ImageNet-1K at 512px, ELIT improves FID and FDD by 35.3% and 39.6% on average, respectively, and consistently enhances performance across diverse architectures including DiT, U-ViT, HDiT, and MM-DiT.
📝 Abstract
Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\%$ and $39.6\%$ in FID and FDD scores. Project page: https://snap-research.github.io/elit/