🤖 AI Summary
This work addresses the high computational cost of full spatiotemporal attention in diffusion Transformers for video generation, where existing block-sparse methods suffer significant quality degradation at high sparsity levels. The authors propose a training-free, dynamic fine-grained sparse attention framework that establishes, for the first time, a theoretical lower bound on recall for block-sparse attention. Guided by this bound, they design an efficient sparsification mechanism that adapts to the dynamic nature of attention maps. Their approach integrates Hilbert curve-based token reordering, hierarchical block importance scoring, adaptive sparse mask caching, and GPU-friendly fine-grained computation. This enables high-quality video generation while achieving up to 2.1× end-to-end inference speedup, substantially outperforming current sparse attention methods.
📝 Abstract
Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block sparse attention is a common approach to mitigate this by focusing computation on important regions. However, attention maps in DiTs exhibit inherently dynamic and fine-grained sparsity, which causes existing block sparse attention methods to degrade significantly in quality, especially at high sparsity ratios. In this paper, we revisit block sparse attention and derive a theoretical lower bound on attention recall to characterize the key factors governing its effectiveness. Guided by these insights, we propose DFSAttn, a training-free sparse attention framework that enables dynamic, fine-grained sparsification efficiently. DFSAttn incorporates three core designs: Hilbert curve-based token reordering to achieve fine-grained sparsity while preserving efficient GPU execution, hierarchical block scoring for accurate block importance estimation, and sparse mask caching with adaptive ratios to balance accuracy and efficiency. Experimental results demonstrate that DFSAttn consistently outperforms prior methods under high sparsity, achieving up to 2.1$\times$ end-to-end speedup while maintaining high generation quality. Our code is open-sourced and available at https://github.com/jessica-hujie/DFSAttn.