🤖 AI Summary
This work addresses the inefficiency in parameter usage and high computational cost of visual generative models by proposing an elastic recurrent Transformer architecture. The design employs weight-shared recurrent blocks to enable efficient generation and introduces, for the first time, an intra-loop self-distillation (ILSD) mechanism that jointly optimizes multiple elastic model variants within a single training run. This framework supports inference at arbitrary stages, allowing dynamic trade-offs between computational expenditure and generation quality. The method achieves a FID of 2.0 on ImageNet 256×256 with only one-quarter of the parameters of prior approaches and attains an FVD of 72.8 on UCF-101, substantially advancing the state of the art in efficient visual synthesis.
📝 Abstract
We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model's depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With $4\times$ reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of $2.0$ on class-conditional ImageNet $256 \times 256$ and FVD of $72.8$ on class-conditional UCF-101.