🤖 AI Summary
Existing sparse attention methods are typically model-specific, lacking cross-architecture generality and end-to-end performance guarantees. This paper proposes SpargeAttn—the first general-purpose, quantization-aware, and softmax-aware two-stage online sparse attention mechanism. In the first stage, a lightweight predictor dynamically identifies redundant attention positions; in the second stage, low-bit quantization is integrated with hardware-friendly sparse kernels to skip invalid matrix multiplications. SpargeAttn requires no model retraining and supports plug-and-play deployment across multimodal large models—including language, vision, and video architectures. Under zero degradation on end-to-end metrics (e.g., BLEU, FID, LPIPS), it achieves up to 3.2× inference speedup. The implementation is open-sourced.
📝 Abstract
An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The codes are available at https://github.com/thu-ml/SpargeAttn.