🤖 AI Summary
Large language models (LLMs) face dual bottlenecks—computational inefficiency and representational incompleteness—in long-context modeling due to the quadratic complexity of standard self-attention. To address this, we propose a novel sparse attention mechanism grounded in receptive field theory: hierarchical exponential skip connections enable the receptive field to grow exponentially (2ᵈ) across d layers, ensuring global, contiguous, and exponentially expanding coverage of input tokens for each output token. Our design unifies the efficiency of static sparsity with the expressiveness of dynamic sparsity, maintains architectural simplicity, and fully supports autoregressive decoding. Experiments demonstrate 5–40% accuracy gains on long-range benchmarks (e.g., Passkey, RULER); at 128K context length, our method achieves 3.0× speedup over full attention in both prefilling and decoding—approaching the linear complexity of sliding-window attention.
📝 Abstract
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. Sparse attention methods offer a promising solution, but existing approaches often suffer from incomplete effective context and/or require complex implementation of pipeline. We present a comprehensive analysis of sparse attention for autoregressive LLMs from the respective of receptive field, recognize the suboptimal nature of existing methods for expanding the receptive field, and introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension through the theoretical analysis. PowerAttention achieves exponential receptive field growth in $d$-layer LLMs, allowing each output token to attend to $2^d$ tokens, ensuring completeness and continuity of the receptive field. Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5sim 40%$, especially on tasks demanding long-range dependencies like Passkey Retrieval and RULER, while maintaining a comparable time complexity to sliding window attention. Efficiency evaluations further highlight PowerAttention's superior speedup in both prefilling and decoding phases compared with dynamic sparse attentions and full attention ($3.0 imes$ faster on 128K context), making it a highly effective and user-friendly solution for processing long sequences in LLMs.