🤖 AI Summary
To address spatiotemporal computational redundancy inherent in static inference of Diffusion Transformer (DiT)-based models, this paper proposes a dual-dimensional dynamic generation architecture: Time-step-aware Dynamic Width adjustment (TDW) and Spatially-aware Dynamic Token pruning (SDT). It is the first work to deeply integrate dynamic computation with flow matching modeling and introduces Temporal-Dependent LoRA (TD-LoRA), enabling adaptive parameter allocation across timesteps. The framework supports mainstream backbones—including DiT, SiT, Latte, and FLUX—without architectural modification. Extensive experiments on image generation, video synthesis, and text-to-image tasks demonstrate up to 2.1× inference speedup while maintaining or improving generation quality (e.g., FID, LPIPS). Furthermore, fine-tuning parameter count is reduced by over 90%, significantly enhancing training efficiency and deployment flexibility.
📝 Abstract
Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the emph{static} inference paradigm, which inevitably introduces redundant computation in certain emph{diffusion timesteps} and emph{spatial regions}. To overcome this inefficiency, we propose extbf{Dy}namic extbf{Di}ffusion extbf{T}ransformer (DyDiT), an architecture that emph{dynamically} adjusts its computation along both emph{timestep} and emph{spatial} dimensions. Specifically, we introduce a emph{Timestep-wise Dynamic Width} (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a emph{Spatial-wise Dynamic Token} (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerates the generation process. Building on these designs, we further enhance DyDiT in three key aspects. First, DyDiT is integrated seamlessly with flow matching-based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter-efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT.