Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video diffusion Transformers (vDiTs) suffer from inefficient inference for long-video generation due to the quadratic computational complexity of self-attention. This work first identifies strong structural patterns—such as diagonal, multi-diagonal, and vertical stripe motifs—in vDiT attention maps across layers and attention heads, coupled with weak content dependence. Leveraging this insight, we propose a hardware-aware sparse acceleration framework comprising: (i) pattern-customized sparse attention kernels, (ii) an offline FLOP-aware sparse strategy search algorithm, and (iii) intra-layer attention head fusion under identical sparsity patterns. Evaluations on CogVideoX-1.5, HunyuanVideo, and Wan2.1 demonstrate up to 2.38× theoretical FLOP reduction and 1.85× measured end-to-end inference speedup, while preserving visual fidelity—achieving PSNR of 27.09 with no perceptible quality degradation.

Technology Category

Application Category

📝 Abstract
While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6% attention heads can be skipped. Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. 2) An offline sparse diffusion search algorithm that selects the optimal sparse computation strategy per layer and head via hardware-aware cost modeling. After determining the optimal configuration, we fuse heads within the same layer that share the same attention strategy, enhancing inference efficiency. Integrated into state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1), Sparse-vDiT achieves 2.09$ imes$, 2.38$ imes$, and 1.67$ imes$ theoretical FLOP reduction, and actual inference speedups of 1.76$ imes$, 1.85$ imes$, and 1.58$ imes$, respectively, while maintaining high visual fidelity, with PSNR values reaching 24.13, 27.09, and 22.59. Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.
Problem

Research questions and friction points this paper is trying to address.

Reducing quadratic complexity of attention in video diffusion transformers
Identifying and leveraging sparsity patterns in attention maps
Accelerating inference while maintaining visual fidelity in video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pattern-optimized sparse kernels for efficient attention
Offline sparse diffusion search for optimal strategy
Head fusion within layers to boost efficiency
Pengtao Chen
Pengtao Chen
Ph.D. Student, Fudan University
Computer VisionDiffusion ModelEfficient Deep Learning
X
Xianfang Zeng
StepFun
M
Maosen Zhao
Fudan University
P
Peng Ye
The Chinese University of Hong Kong
M
Mingzhu Shen
Imperial College London
W
Wei Cheng
StepFun
G
Gang Yu
StepFun
T
Tao Chen
Fudan University