🤖 AI Summary
Video diffusion Transformers (DiTs) achieve high generation quality but suffer from low inference efficiency due to the quadratic complexity of self-attention and the multi-step diffusion process. To address this, we propose a lightweight, training-free inference acceleration framework. First, we design a dynamic block-wise attention mechanism guided by a 3D Z-order space-filling curve to enable spatially aware attention pruning. Second, we introduce a progressive latent resolution upsampling strategy: global structure is modeled rapidly at low resolution, while local details are refined at high resolution—aligning naturally with sparse attention patterns. Evaluated on VBench, our method achieves an 8.83× speedup with only a 0.01% FID degradation. Inference time drops from minutes to seconds, enabling plug-and-play deployment without architectural or training modifications.
📝 Abstract
Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$ imes$ speedup with 0.01% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: https://github.com/dvlab-research/Jenga