ELT: Elastic Looped Transformers for Visual Generation

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
This work addresses the inefficiency in parameter usage and high computational cost of visual generative models by proposing an elastic recurrent Transformer architecture. The design employs weight-shared recurrent blocks to enable efficient generation and introduces, for the first time, an intra-loop self-distillation (ILSD) mechanism that jointly optimizes multiple elastic model variants within a single training run. This framework supports inference at arbitrary stages, allowing dynamic trade-offs between computational expenditure and generation quality. The method achieves a FID of 2.0 on ImageNet 256×256 with only one-quarter of the parameters of prior approaches and attains an FVD of 72.8 on UCF-101, substantially advancing the state of the art in efficient visual synthesis.

Technology Category

Application Category

📝 Abstract
We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model's depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With $4\times$ reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of $2.0$ on class-conditional ImageNet $256 \times 256$ and FVD of $72.8$ on class-conditional UCF-101.
Problem

Research questions and friction points this paper is trying to address.

visual generation
parameter efficiency
computational cost
generation quality
model efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Elastic Looped Transformers
parameter efficiency
Intra-Loop Self Distillation
Any-Time inference
weight-shared recurrent architecture