A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the impact of attention and residual "sinks"—tokens or dimensions that consistently produce anomalously large activations—on training stability in large language models. The authors propose an "anomaly-driven rescaling" mechanism, revealing that such sinks primarily act as rescaling factors rather than substantive signal contributors, working in concert with normalization layers like RMSNorm to stabilize training dynamics. Building on this insight, they design learnable sink-absorbing or gating-based rescaling strategies, further optimized with outlier clipping. Evaluated across diverse architectures and training scales, the approach improves average training performance by 2 percentage points and significantly enhances robustness under W4A4 quantization, incurring only a 1.2-point performance drop.

Technology Category

Application Category

📝 Abstract
We investigate the functional role of emergent outliers in large language models, specifically attention sinks (a few tokens that consistently receive large attention logits) and residual sinks (a few fixed dimensions with persistently large activations across most tokens). We hypothesize that these outliers, in conjunction with the corresponding normalizations (\textit{e.g.}, softmax attention and RMSNorm), effectively rescale other non-outlier components. We term this phenomenon \textit{outlier-driven rescaling} and validate this hypothesis across different model architectures and training token counts. This view unifies the origin and mitigation of both sink types. Our main conclusions and observations include: (1) Outliers function jointly with normalization: removing normalization eliminates the corresponding outliers but degrades training stability and performance; directly clipping outliers while retaining normalization leads to degradation, indicating that outlier-driven rescaling contributes to training stability. (2) Outliers serve more as rescale factors rather than contributors, as the final contributions of attention and residual sinks are significantly smaller than those of non-outliers. (3) Outliers can be absorbed into learnable parameters or mitigated via explicit gated rescaling, leading to improved training performance (average gain of 2 points) and enhanced quantization robustness (1.2 points degradation under W4A4 quantization).
Problem

Research questions and friction points this paper is trying to address.

attention sinks
residual sinks
outliers
Transformer training
normalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

outlier-driven rescaling
attention sinks
residual sinks
Transformer training
quantization robustness
🔎 Similar Papers
No similar papers found.