FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

Diffusion Transformers (DiTs) suffer from substantial computational redundancy during inference, resulting in high FLOPs overhead. To address this, we propose the first dynamic computation allocation paradigm for DiTs that requires no retraining and preserves the original architecture: it enables multi-level, on-demand compute scheduling on a single pre-trained DiT via computation-aware dynamic layer skipping and adaptive token compression. Our method is compatible with diverse DiT backbones and cross-modal tasks—including image/video generation and class-/text-/video-conditioned synthesis—while maintaining lossless generation quality. Experiments demonstrate up to 40% FLOPs reduction in image generation and as much as 75% compute savings in video generation, with consistent efficacy across multi-task settings. The core contribution is the first realization of decoupled compute–quality trade-offs in DiT inference: enabling universal, plug-and-play efficient inference without performance degradation.

Technology Category

Application Category

📝 Abstract

Despite their remarkable performance, modern Diffusion Transformers are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we revisit the conventional static paradigm that allocates a fixed compute budget per denoising iteration and propose a dynamic strategy instead. Our simple and sample-efficient framework enables pre-trained DiT models to be converted into emph{flexible} ones -- dubbed FlexiDiT -- allowing them to process inputs at varying compute budgets. We demonstrate how a single emph{flexible} model can generate images without any drop in quality, while reducing the required FLOPs by more than $40$% compared to their static counterparts, for both class-conditioned and text-conditioned image generation. Our method is general and agnostic to input and conditioning modalities. We show how our approach can be readily extended for video generation, where FlexiDiT models generate samples with up to $75$% less compute without compromising performance.

Problem

Research questions and friction points this paper is trying to address.

Reduces compute requirements for Diffusion Transformers

Enables dynamic compute allocation per denoising step

Maintains high-quality image and video generation efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic compute budget strategy

Flexible pre-trained DiT models

Reduced FLOPs for quality generation

🔎 Similar Papers

No similar papers found.