The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study investigates the co-occurrence of large-scale activations and attention sinks in Transformer language models, elucidating their underlying causes and functional mechanisms. Through ablation studies, cross-layer representation analysis, and attention head tracing, the work establishes a clear functional distinction: large-scale activations encode global implicit parameters, while attention sinks impose local biases. Crucially, the research demonstrates that this co-occurrence is not semantically necessary but rather an artifact induced by pre-normalization architecture. Removing pre-normalization effectively decouples the two phenomena, revealing their non-causal relationship. These findings advance the understanding of internal dynamics in Transformers and highlight the profound influence of modern architectural choices on the structure of internal representations.

Technology Category

Application Category

📝 Abstract

We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.

Problem

Research questions and friction points this paper is trying to address.

massive activations

attention sinks

Transformer language models

co-occurrence

functional roles

Innovation

Methods, ideas, or system contributions that make the work stand out.

massive activations

attention sinks

Transformer architecture