A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

244K/year
🤖 AI Summary
This work addresses the lack of a unified theoretical understanding of how the inverse temperature should scale with context length in long-context self-attention mechanisms. The authors propose a unified framework based on an upper-tail cumulative scale derived from a gap-counting function of attention score rows, which determines the critical inverse temperature scaling required for softmax concentration. Through probabilistic concentration analysis and investigation of attention entropy dynamics, they demonstrate that this scale marks the threshold between entropy collapse and well-behaved attention distributions. The resulting theory, which depends on the number of competing gaps, unifies several previously proposed scaling laws. The framework applies broadly—from idealized models to practical Transformers—and yields a readily applicable diagnostic tool for practitioners.
📝 Abstract
Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$ and $(\log n)^2$. We provide a general theory showing that the desirable scale is determined by the gap-counting function $N_n$ of each attention row. Counting how many competitors lie within each gap from the maximum, we define an upper-tail accumulation scale and prove that it gives the critical inverse-temperature scale for softmax concentration: below this scale, the top competitors remain unseparated, whereas above it, the attention entropy collapses. This framework unifies prior scaling laws as different $N_n$ and yields a direct diagnostic for attention-score families, from idealized theoretical models to more practical transformers.
Problem

Research questions and friction points this paper is trying to address.

inverse temperature
self-attention
scaling law
context length
softmax concentration
Innovation

Methods, ideas, or system contributions that make the work stand out.

inverse temperature
self-attention
scaling law
softmax concentration
gap-counting function
🔎 Similar Papers
No similar papers found.