SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing sparse attention methods are typically model-specific, lacking cross-architecture generality and end-to-end performance guarantees. This paper proposes SpargeAttn—the first general-purpose, quantization-aware, and softmax-aware two-stage online sparse attention mechanism. In the first stage, a lightweight predictor dynamically identifies redundant attention positions; in the second stage, low-bit quantization is integrated with hardware-friendly sparse kernels to skip invalid matrix multiplications. SpargeAttn requires no model retraining and supports plug-and-play deployment across multimodal large models—including language, vision, and video architectures. Under zero degradation on end-to-end metrics (e.g., BLEU, FID, LPIPS), it achieves up to 3.2× inference speedup. The implementation is open-sourced.

Technology Category

Application Category

📝 Abstract
An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The codes are available at https://github.com/thu-ml/SpargeAttn.
Problem

Research questions and friction points this paper is trying to address.

Universal sparse attention for diverse models
Accelerating model inference without performance loss
Two-stage online filter for efficient computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal sparse attention method
Two-stage online filter
Quantized attention optimization