Critical attention scaling in long-context transformers

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Attention degradation in long-context Transformers—where attention scores converge to a uniform distribution (rank collapse) as context length $n$ increases, impairing representational capacity—is a critical issue. This work introduces an analytically tractable attention model and employs asymptotic analysis coupled with phase transition theory to rigorously establish the critical scaling threshold $eta_n asymp log n$. This result provides the first theoretical foundation for logarithmic scaling strategies adopted empirically in systems such as YaRN and Qwen. Further analysis reveals three distinct regimes: sub-logarithmic scaling induces token collapse toward a single direction; super-logarithmic scaling degenerates attention into an identity mapping; only logarithmic rescaling preserves both sparsity and content adaptivity, thereby sustaining semantic discriminability in long-context settings.

Technology Category

Application Category

📝 Abstract

As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $ extit{attention scaling}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $β_n$, theoretical justification for this approach remains lacking. We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $β_n$: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling $β_n asymp log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.

Problem

Research questions and friction points this paper is trying to address.

Addresses attention score collapse in long-context transformers

Identifies critical scaling threshold for effective attention mechanisms

Explains why logarithmic scaling maintains sparse adaptive attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention scaling with polylogarithmic factor addresses rank-collapse

Critical scaling identified as logarithmic function of context length

Logarithmic scaling maintains sparse content-adaptive attention

🔎 Similar Papers

Unifying Linear-Time Attention via Latent Probabilistic Modelling

2024-02-27Citations: 2

💼 Related Jobs

Research Engineer, Monetization AI