Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dense attention in large language models incurs O(N²H) computational complexity, severely limiting training efficiency for long contexts; existing sparse attention methods struggle to balance efficiency and modeling capacity. This paper introduces SPAttention, the first framework for *principled structural sparsity*: it partitions multi-head attention by token distance across heads, enabling functional specialization and collaborative interaction among heads—replacing independent head computations with a structured inductive bias. This design ensures balanced computational load, eliminates redundancy, and unifies multi-head attention into a coherent collaborative process. Evaluated on the OLMoE model family, SPAttention achieves nearly 2× higher training throughput while matching or surpassing dense attention in performance, and consistently outperforms established baselines—including Longformer, Reformer, and BigBird—across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of $O(H cdot N^2)$ that grows quadratically with the context size ($N$) and linearly with the number of heads ($H$). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment. This approach transforms the multi-head attention mechanism from $H$ independent $O(N^2)$ computations into a single, collaborative $O(N^2)$ computation, fundamentally reducing complexity by a factor of $H$. The structured inductive bias compels functional specialization among heads, enabling a more efficient allocation of computational resources from redundant modeling to distinct dependencies across the entire sequence span. Extensive empirical validation on the OLMoE-1B-7B and 0.25B-1.75B model series demonstrates that while delivering an approximately two-fold increase in training throughput, its performance is on par with standard dense attention, even surpassing it on select key metrics, while consistently outperforming representative sparse attention methods including Longformer, Reformer, and BigBird across all evaluation metrics.
Problem

Research questions and friction points this paper is trying to address.

Resolving the quadratic complexity of standard attention mechanisms in LLMs
Eliminating computational redundancy across multiple attention heads
Achieving speed gains without sacrificing model performance quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Partitions attention workload into non-overlapping distance bands
Assigns each head a unique segment of computational task
Transforms multi-head attention into collaborative single computation
🔎 Similar Papers
No similar papers found.
M
Mingkuan Zhao
Xi’an Jiaotong University
Wentao Hu
Wentao Hu
PhD student, The Hong Kong Polytechnic University
Large Language ModelComputer Vision
Jiayin Wang
Jiayin Wang
Tsinghua University
User ModelingPersonalization
Xin Lai
Xin Lai
ByteDance
Multimodal UnderstandingMultimodal Agent
T
Tianchen Huang
University of Science and Technology of China
Y
Yuheng Min
Tsinghua University
R
Rui Yan
University of California, San Diego
X
Xiao-yi Zhu
Xi’an Jiaotong University