Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

Existing sparse attention methods for mitigating the quadratic complexity bottleneck in the prefill phase of long-context large language models rely on predefined patterns or coarse approximations, failing to accurately capture real attention dynamics and thus compromising the accuracy–efficiency trade-off. This work first identifies and empirically validates strong intra-layer similarity across attention heads—a consistent structural property of multi-head attention. Leveraging this insight, we propose a cross-head sparse pattern sharing mechanism: a lightweight dynamic module extracts a shared sparse attention pattern, while preserving full attention computation for a small set of critical heads. Our method maintains state-of-the-art accuracy while significantly accelerating prefill. On multiple long-text benchmarks, it achieves prefill throughput competitive with or superior to current best sparse approaches. This establishes a new paradigm for efficient long-context inference.

Technology Category

Application Category

📝 Abstract

Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference, mitigating the quadratic complexity of full attention computation. While existing sparse attention methods rely on predefined patterns or inaccurate estimations to approximate attention behavior, they often fail to fully capture the true dynamics of attention, resulting in reduced efficiency and compromised accuracy. Instead, we propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads, enabling a more realistic capture of the dynamic behavior of attention. Our approach is grounded in two key observations: (1) attention patterns demonstrate strong inter-head similarity, and (2) this similarity remains remarkably consistent across diverse inputs. By strategically sharing computed accurate patterns across attention heads, our method effectively captures actual patterns while requiring full attention computation for only a small subset of heads. Comprehensive evaluations demonstrate that our approach achieves superior or comparable speedup relative to state-of-the-art methods while delivering the best overall accuracy.

Problem

Research questions and friction points this paper is trying to address.

Reducing quadratic complexity in long-context LLM prefilling

Improving accuracy of sparse attention pattern estimation

Sharing precise attention patterns across heads efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shares precise attention patterns across heads

Reduces full attention computation significantly

Maintains high accuracy with improved speed

🔎 Similar Papers

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention