🤖 AI Summary
This study investigates the co-occurrence of large-scale activations and attention sinks in Transformer language models, elucidating their underlying causes and functional mechanisms. Through ablation studies, cross-layer representation analysis, and attention head tracing, the work establishes a clear functional distinction: large-scale activations encode global implicit parameters, while attention sinks impose local biases. Crucially, the research demonstrates that this co-occurrence is not semantically necessary but rather an artifact induced by pre-normalization architecture. Removing pre-normalization effectively decouples the two phenomena, revealing their non-causal relationship. These findings advance the understanding of internal dynamics in Transformers and highlight the profound influence of modern architectural choices on the structure of internal representations.
📝 Abstract
We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.