🤖 AI Summary
Existing sparse attention methods for mitigating the quadratic complexity bottleneck in the prefill phase of long-context large language models rely on predefined patterns or coarse approximations, failing to accurately capture real attention dynamics and thus compromising the accuracy–efficiency trade-off. This work first identifies and empirically validates strong intra-layer similarity across attention heads—a consistent structural property of multi-head attention. Leveraging this insight, we propose a cross-head sparse pattern sharing mechanism: a lightweight dynamic module extracts a shared sparse attention pattern, while preserving full attention computation for a small set of critical heads. Our method maintains state-of-the-art accuracy while significantly accelerating prefill. On multiple long-text benchmarks, it achieves prefill throughput competitive with or superior to current best sparse approaches. This establishes a new paradigm for efficient long-context inference.
📝 Abstract
Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference, mitigating the quadratic complexity of full attention computation. While existing sparse attention methods rely on predefined patterns or inaccurate estimations to approximate attention behavior, they often fail to fully capture the true dynamics of attention, resulting in reduced efficiency and compromised accuracy. Instead, we propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads, enabling a more realistic capture of the dynamic behavior of attention. Our approach is grounded in two key observations: (1) attention patterns demonstrate strong inter-head similarity, and (2) this similarity remains remarkably consistent across diverse inputs. By strategically sharing computed accurate patterns across attention heads, our method effectively captures actual patterns while requiring full attention computation for only a small subset of heads. Comprehensive evaluations demonstrate that our approach achieves superior or comparable speedup relative to state-of-the-art methods while delivering the best overall accuracy.