🤖 AI Summary
Attention degradation in long-context Transformers—where attention scores converge to a uniform distribution (rank collapse) as context length $n$ increases, impairing representational capacity—is a critical issue. This work introduces an analytically tractable attention model and employs asymptotic analysis coupled with phase transition theory to rigorously establish the critical scaling threshold $eta_n asymp log n$. This result provides the first theoretical foundation for logarithmic scaling strategies adopted empirically in systems such as YaRN and Qwen. Further analysis reveals three distinct regimes: sub-logarithmic scaling induces token collapse toward a single direction; super-logarithmic scaling degenerates attention into an identity mapping; only logarithmic rescaling preserves both sparsity and content adaptivity, thereby sustaining semantic discriminability in long-context settings.
📝 Abstract
As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $ extit{attention scaling}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $β_n$, theoretical justification for this approach remains lacking.
We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $β_n$: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling $β_n asymp log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.