HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

244K/year
🤖 AI Summary
Video diffusion models suffer from slow inference due to the quadratic complexity of full attention mechanisms, while existing training-free sparse attention methods struggle to balance efficiency and generation quality, often constrained by fixed thresholds and high masking overhead. This work proposes a training-free, head-adaptive sparse attention framework that introduces, for the first time, a head-level adaptive top-p sparsification mechanism. By integrating temporal mask reuse and error-guided global budget calibration, the method dynamically optimizes sparsity strategies across individual attention heads. Evaluated on the Video DiT architecture with Wan2.1-1.3B and Wan2.1-14B models, the approach achieves up to a 1.93× speedup in 720p video generation while preserving high video quality and similarity metrics.
📝 Abstract
Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive because it accelerates pretrained models without retraining, yet existing online top-$p$ sparse attention still spends non-negligible cost on mask prediction and applies shared thresholds despite strong head-level heterogeneity. We show that these two overlooked factors limit the practical speed-quality trade-off of training-free sparse attention in Video DiTs. To address them, we introduce a head-wise adaptive framework with two plug-in components: Temporal Mask Reuse, which skips unnecessary mask prediction based on query-key drift, and Error-guided Budgeted Calibration, which assigns per-head top-$p$ thresholds by minimizing measured model-output error under a global sparsity budget. On Wan2.1-1.3B and Wan2.1-14B, our method consistently improves XAttention and SVG2, achieving up to 1.93 times speedup at 720P while maintaining competitive video quality and similarity metrics.
Problem

Research questions and friction points this paper is trying to address.

sparse attention
video diffusion
training-free acceleration
attention head heterogeneity
mask prediction overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free acceleration
head-wise adaptive sparse attention
temporal mask reuse
error-guided calibration
video diffusion
🔎 Similar Papers
X
Xuzhe Zheng
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Y
Yuexiao Ma
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
J
Jing Xu
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Xiawu Zheng
Xiawu Zheng
Associate Professor, IEEE Senior Member, Xiamen University
Automated Machine LearningNetwork CompressionNeural Architecture SearchAutoML
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
F
Fei Chao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China