PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention

📅 2025-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face dual bottlenecks—computational inefficiency and representational incompleteness—in long-context modeling due to the quadratic complexity of standard self-attention. To address this, we propose a novel sparse attention mechanism grounded in receptive field theory: hierarchical exponential skip connections enable the receptive field to grow exponentially (2ᵈ) across d layers, ensuring global, contiguous, and exponentially expanding coverage of input tokens for each output token. Our design unifies the efficiency of static sparsity with the expressiveness of dynamic sparsity, maintains architectural simplicity, and fully supports autoregressive decoding. Experiments demonstrate 5–40% accuracy gains on long-range benchmarks (e.g., Passkey, RULER); at 128K context length, our method achieves 3.0× speedup over full attention in both prefilling and decoding—approaching the linear complexity of sliding-window attention.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. Sparse attention methods offer a promising solution, but existing approaches often suffer from incomplete effective context and/or require complex implementation of pipeline. We present a comprehensive analysis of sparse attention for autoregressive LLMs from the respective of receptive field, recognize the suboptimal nature of existing methods for expanding the receptive field, and introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension through the theoretical analysis. PowerAttention achieves exponential receptive field growth in $d$-layer LLMs, allowing each output token to attend to $2^d$ tokens, ensuring completeness and continuity of the receptive field. Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5sim 40%$, especially on tasks demanding long-range dependencies like Passkey Retrieval and RULER, while maintaining a comparable time complexity to sliding window attention. Efficiency evaluations further highlight PowerAttention's superior speedup in both prefilling and decoding phases compared with dynamic sparse attentions and full attention ($3.0 imes$ faster on 128K context), making it a highly effective and user-friendly solution for processing long sequences in LLMs.
Problem

Research questions and friction points this paper is trying to address.

Addresses efficiency bottlenecks in LLMs due to quadratic attention complexity.
Introduces PowerAttention for exponential receptive field growth in LLMs.
Improves performance on tasks requiring long-range dependencies.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Exponential receptive field growth in LLMs
Ensures completeness and continuity of context
Superior speedup in prefilling and decoding phases
🔎 Similar Papers
No similar papers found.
Lida Chen
Lida Chen
Doctor of Computer Science, Fudan University
Long Context Model
D
Dong Xu
Knowledge Works Research Laboratory, School of Computer Science and Technology, Fudan University, Shanghai, China
Chenxin An
Chenxin An
The University of Hong Kong
Long-context LLMs
X
Xintao Wang
Knowledge Works Research Laboratory, School of Computer Science and Technology, Fudan University, Shanghai, China
Yikai Zhang
Yikai Zhang
Fudan university
Natural Language ProcessingAutonomous Agent
Jiangjie Chen
Jiangjie Chen
ByteDance Seed
NLPMachine ReasoningLarge Language ModelsAutonomous Agent
Zujie Liang
Zujie Liang
ByteDance
Large Language ModelsDeep LearningVision & Language
Feng Wei
Feng Wei
Assistant Professor, Orthopaedic Biomechanics Laboratories, Michigan State University
Injury BiomechanicsForensic BiomechanicsComputational Biomechanics
Jiaqing Liang
Jiaqing Liang
Fudan University
knowledge graphdeep learning
Y
Yanghua Xiao
Knowledge Works Research Laboratory, School of Computer Science and Technology, Fudan University, Shanghai, China
W
Wei Wang
Knowledge Works Research Laboratory, School of Computer Science and Technology, Fudan University, Shanghai, China