🤖 AI Summary
This work uncovers the structural origin of attention sinks at the first token in large language models, demonstrating that variance disparities in representations—arising from value aggregation in self-attention—are dramatically amplified by hypersensitive neurons in feedforward networks, leading to dimensional imbalance. The study provides the first mechanistic explanation of attention sinks by establishing a complete causal chain from value aggregation and hypersensitive neuron activation to dimensional imbalance. Through targeted interventions such as attention mask modification and variance enhancement at specific tokens, the sink phenomenon is controllably reproduced at arbitrary positions. Furthermore, the authors propose a head-wise RMSNorm architecture that effectively restores statistical equilibrium across token representations, substantially accelerating pretraining convergence.
📝 Abstract
Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii) amplifying the variance of targeted token representations. Both interventions can replicate attention sinks at arbitrary positions. Our mechanistic understanding offers a foundation for the systematic control of sink formation. Finally, as a proof of concept, we propose \textit{head-wise RMSNorm}, an architectural modification that stabilizes value aggregation outputs during pre-training. Our experiments demonstrate that restoring statistical parity across positions significantly accelerates convergence.