ProxyAttn: Guided Sparse Attention via Representative Heads

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address the quadratic computational complexity $O(n^2)$ of attention in large language models (LLMs) and the performance degradation of existing dynamic block-sparse methods under high sparsity—caused by coarse-grained importance estimation—this paper proposes a **training-free, fine-grained sparse attention mechanism**. The method leverages inter-head similarity to select representative attention heads as proxies for computing block-level importance scores, and integrates a dynamic, block-aware budget allocation strategy to enable more precise sparsity decisions. Crucially, it requires only forward-pass inference, introducing no additional parameters or training overhead. Evaluated across multiple mainstream LLMs and long-context benchmarks, the approach achieves up to 10.3× reduction in attention computation and 2.4× speedup in prefill latency, while preserving original model accuracy with negligible performance loss.

Technology Category

Application Category

📝 Abstract

The quadratic complexity of attention mechanisms limits the efficiency of Large Language Models (LLMs) on long-text tasks. Recently, methods that dynamically estimate block importance have enabled efficient block sparse attention, leading to significant acceleration in long-text pre-filling of LLMs. However, their coarse-grained estimation inevitably leads to performance degradation at high sparsity rates. In this work, we propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation by compressing the dimension of attention heads. Based on our observation of the similarity among multiple attention heads, we use the scores of pooled representative heads to approximate the scores for all heads. To account for the varying sparsity among heads, we also propose a block-aware dynamic budget estimation method. By combining the scores from representative proxy heads with multi-head dynamic budgets, we achieve a more fine-grained block importance evaluation at low computational cost. Experiments on a variety of mainstream models and extensive benchmarks confirm the underlying similarity among attention heads. Leveraging a fine-grained estimation, the proposed method achieves substantial gains in performance and efficiency compared to existing methods. More precisely, ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss. Our code is available at https://github.com/wyxstriker/ProxyAttn.

Problem

Research questions and friction points this paper is trying to address.

Reducing quadratic complexity of attention in LLMs

Improving block importance estimation for sparse attention

Enhancing performance and efficiency in long-text tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compresses attention head dimensions for precise estimation

Uses representative proxy heads to approximate all heads

Employs block-aware dynamic budget for fine-grained evaluation

🔎 Similar Papers

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention