Scaling Context Requires Rethinking Attention

๐Ÿ“… 2025-07-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing approaches for long-sequence modeling face a fundamental trade-off: standard Transformers incur prohibitive O(Lยฒ) computational complexity, while linear or sub-quadratic alternatives suffer from weak context learning capabilities. Method: This paper introduces Power Attentionโ€”a novel attention mechanism that decouples hidden state dimensionality from model parameter count. It incorporates independently tunable latent states, a linear attention variant constrained by sliding windows, and a deeply optimized fused GPU kernel. Contribution/Results: Power Attention achieves O(L) time complexity while significantly outperforming both standard linear attention and exponential (e.g., softmax-based) attention in long-range contextual reasoning. Experiments demonstrate state-of-the-art performance across multiple long-text benchmark tasks, effectively breaking computational and memory bottlenecks in long-sequence training.

Technology Category

Application Category

๐Ÿ“ Abstract
We argue that neither transformers nor sub-quadratic architectures are well suited to training at long sequence lengths: the cost of processing the context is too expensive in the former, too inexpensive in the latter. Approaches such as sliding window attention which reduce the cost-per-token of a transformer impair in-context learning, and so are also unsuitable. To address these limitations, we introduce power attention, an architectural layer for linear-cost sequence modeling whose state size can be adjusted independently of parameters, unlocking the advantages of linear attention on practical domains. We develop and open-source a set of GPU kernels for efficient power attention, identifying a novel pattern of operation fusion to avoid memory and bandwidth bottlenecks. Our experiments on the in-context learning of power attention shows that these models dominate both exponential attention and linear attention at long-context training.
Problem

Research questions and friction points this paper is trying to address.

High cost of processing long sequences in transformers
Inadequate performance of sub-quadratic architectures for long sequences
Sliding window attention impairs in-context learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces power attention for linear-cost modeling
Develops GPU kernels for efficient operation
Dominates exponential and linear attention performance
๐Ÿ”Ž Similar Papers