Training-Free Efficient Video Generation via Dynamic Token Carving

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Video diffusion Transformers (DiTs) achieve high generation quality but suffer from low inference efficiency due to the quadratic complexity of self-attention and the multi-step diffusion process. To address this, we propose a lightweight, training-free inference acceleration framework. First, we design a dynamic block-wise attention mechanism guided by a 3D Z-order space-filling curve to enable spatially aware attention pruning. Second, we introduce a progressive latent resolution upsampling strategy: global structure is modeled rapidly at low resolution, while local details are refined at high resolution—aligning naturally with sparse attention patterns. Evaluated on VBench, our method achieves an 8.83× speedup with only a 0.01% FID degradation. Inference time drops from minutes to seconds, enabling plug-and-play deployment without architectural or training modifications.

Technology Category

Application Category

📝 Abstract

Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$ imes$ speedup with 0.01% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: https://github.com/dvlab-research/Jenga

Problem

Research questions and friction points this paper is trying to address.

Reduce computational cost of video Diffusion Transformer models

Overcome quadratic complexity of self-attention in video generation

Minimize multi-step inefficiency in diffusion-based video synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic block-wise attention with 3D curves

Progressive resolution strategy for efficiency

Plug-and-play solution without retraining

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling