Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Video diffusion Transformers suffer from prohibitively low inference efficiency due to the quadratic computational complexity of 3D full attention, severely hindering practical deployment. This paper proposes a training-free, dynamic sparsity-based acceleration framework that, for the first time, identifies and exploits spatiotemporal bimodal sparsity in video diffusion Transformers. Specifically, it performs online analysis to dynamically classify attention heads into spatial-dominant or temporal-dominant categories, enabling structured pruning of 3D attention computations. We further design a hardware-friendly sparse tensor layout and custom CUDA kernels to maximize throughput. Evaluated on CogVideoX-v1.5 and HunyuanVideo, our method achieves end-to-end speedups of 2.28× and 2.33×, respectively, without compromising generation quality. Our core contributions are: (i) the novel conceptualization of spatiotemporal bimodal sparsity; (ii) an online, dynamic head classification mechanism; and (iii) an efficient, end-to-end sparse execution stack.

Technology Category

Application Category

📝 Abstract

Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.

Problem

Research questions and friction points this paper is trying to address.

Reduce computational cost of video generation

Enhance efficiency of Diffusion Transformers

Leverage sparsity in 3D Full Attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages spatial-temporal sparsity

Dynamic attention head classification

Hardware-efficient tensor layout transformation

🔎 Similar Papers

MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion