🤖 AI Summary
This work addresses the lack of a unified theoretical understanding of how the inverse temperature should scale with context length in long-context self-attention mechanisms. The authors propose a unified framework based on an upper-tail cumulative scale derived from a gap-counting function of attention score rows, which determines the critical inverse temperature scaling required for softmax concentration. Through probabilistic concentration analysis and investigation of attention entropy dynamics, they demonstrate that this scale marks the threshold between entropy collapse and well-behaved attention distributions. The resulting theory, which depends on the number of competing gaps, unifies several previously proposed scaling laws. The framework applies broadly—from idealized models to practical Transformers—and yields a readily applicable diagnostic tool for practitioners.
📝 Abstract
Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$ and $(\log n)^2$. We provide a general theory showing that the desirable scale is determined by the gap-counting function $N_n$ of each attention row. Counting how many competitors lie within each gap from the maximum, we define an upper-tail accumulation scale and prove that it gives the critical inverse-temperature scale for softmax concentration: below this scale, the top competitors remain unseparated, whereas above it, the attention entropy collapses. This framework unifies prior scaling laws as different $N_n$ and yields a direct diagnostic for attention-score families, from idealized theoretical models to more practical transformers.