One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the strong coupling between computational cost and image resolution in Diffusion Transformers (DiTs), which uniformly allocate computation and thus struggle to balance latency and generation quality. To overcome this limitation, the authors propose the Elastic Latent Interface (ELIT), which decouples resolution from computation by introducing variable-length latent token sequences and employs a lightweight read-write cross-attention mechanism to dynamically focus on salient regions. ELIT enables a single DiT model to support multiple computational budgets: during training, it learns token importance by randomly dropping trailing tokens, and at inference, it dynamically adjusts computational cost without modifying the backbone architecture. Evaluated on ImageNet-1K at 512px, ELIT improves FID and FDD by 35.3% and 39.6% on average, respectively, and consistently enhances performance across diverse architectures including DiT, U-ViT, HDiT, and MM-DiT.

Technology Category

Application Category

📝 Abstract
Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\%$ and $39.6\%$ in FID and FDD scores. Project page: https://snap-research.github.io/elit/
Problem

Research questions and friction points this paper is trying to address.

Diffusion Transformers
compute budget
latency-quality trade-off
resource allocation
FLOPs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Elastic Latent Interface
Diffusion Transformers
Dynamic Compute Allocation
Importance-Ordered Latents
Cross-Attention Read/Write
🔎 Similar Papers
No similar papers found.