When Attention Sink Emerges in Language Models: An Empirical View

📅 2024-10-14
🏛️ arXiv.org
📈 Citations: 10
Influential: 1
📄 PDF
🤖 AI Summary
This work investigates the fundamental cause of the “attention sink” phenomenon in language models—where abnormally high attention is assigned to the first token despite its semantic irrelevance. We identify that this stems from intra-dependence among attention scores induced by Softmax normalization, reflecting a key bias rather than semantic focus. Through cross-scale model analysis, pretraining trajectory tracking, Sigmoid-based attention substitution, KV cache modeling, and systematic ablation studies, we empirically establish that attention sinks are pervasive across 100M–1B parameter models throughout pretraining, modulated by optimization dynamics and data distribution. Crucially, we demonstrate that non-normalized attention completely eliminates the phenomenon, achieving zero attention sinks in a 1B-parameter model. All code is open-sourced to ensure reproducibility.

Technology Category

Application Category

📝 Abstract
Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at https://github.com/sail-sg/Attention-Sink.
Problem

Research questions and friction points this paper is trying to address.

Investigates attention sink in Language Models (LMs).
Explores influence of pre-training factors on attention sink.
Proposes alternatives to softmax to mitigate attention sink.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Investigates attention sink emergence in LMs.
Replaces softmax with sigmoid to reduce sinks.
Links sink position to loss and data distribution.
🔎 Similar Papers
No similar papers found.