Critical attention scaling in long-context transformers

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Attention degradation in long-context Transformers—where attention scores converge to a uniform distribution (rank collapse) as context length $n$ increases, impairing representational capacity—is a critical issue. This work introduces an analytically tractable attention model and employs asymptotic analysis coupled with phase transition theory to rigorously establish the critical scaling threshold $eta_n asymp log n$. This result provides the first theoretical foundation for logarithmic scaling strategies adopted empirically in systems such as YaRN and Qwen. Further analysis reveals three distinct regimes: sub-logarithmic scaling induces token collapse toward a single direction; super-logarithmic scaling degenerates attention into an identity mapping; only logarithmic rescaling preserves both sparsity and content adaptivity, thereby sustaining semantic discriminability in long-context settings.

Technology Category

Application Category

📝 Abstract
As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $ extit{attention scaling}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $β_n$, theoretical justification for this approach remains lacking. We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $β_n$: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling $β_n asymp log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.
Problem

Research questions and friction points this paper is trying to address.

Addresses attention score collapse in long-context transformers
Identifies critical scaling threshold for effective attention mechanisms
Explains why logarithmic scaling maintains sparse adaptive attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention scaling with polylogarithmic factor addresses rank-collapse
Critical scaling identified as logarithmic function of context length
Logarithmic scaling maintains sparse content-adaptive attention
🔎 Similar Papers
No similar papers found.