🤖 AI Summary
Video diffusion models suffer from slow inference due to the quadratic complexity of full attention mechanisms, while existing training-free sparse attention methods struggle to balance efficiency and generation quality, often constrained by fixed thresholds and high masking overhead. This work proposes a training-free, head-adaptive sparse attention framework that introduces, for the first time, a head-level adaptive top-p sparsification mechanism. By integrating temporal mask reuse and error-guided global budget calibration, the method dynamically optimizes sparsity strategies across individual attention heads. Evaluated on the Video DiT architecture with Wan2.1-1.3B and Wan2.1-14B models, the approach achieves up to a 1.93× speedup in 720p video generation while preserving high video quality and similarity metrics.
📝 Abstract
Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive because it accelerates pretrained models without retraining, yet existing online top-$p$ sparse attention still spends non-negligible cost on mask prediction and applies shared thresholds despite strong head-level heterogeneity. We show that these two overlooked factors limit the practical speed-quality trade-off of training-free sparse attention in Video DiTs. To address them, we introduce a head-wise adaptive framework with two plug-in components: Temporal Mask Reuse, which skips unnecessary mask prediction based on query-key drift, and Error-guided Budgeted Calibration, which assigns per-head top-$p$ thresholds by minimizing measured model-output error under a global sparsity budget. On Wan2.1-1.3B and Wan2.1-14B, our method consistently improves XAttention and SVG2, achieving up to 1.93 times speedup at 720P while maintaining competitive video quality and similarity metrics.