Attention Sinks and Outliers in Attention Residuals

๐Ÿ“… 2026-05-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

258K/year
๐Ÿค– AI Summary
This work addresses the instability and degraded quantization robustness in AttnResidual architectures, which stem from excessive attention concentration and activation outliers caused by their dual-normalization design. To tackle this issue, we propose OASIS, the first method to uncover this underlying mechanism and introduce a null-aware inter-layer signal regulation scheme. By modeling the null space of Softmaxยน, OASIS couples token-level null evidence with deep routing to suppress dominant attention aggregation. Experiments demonstrate that OASIS reduces the maximum โ„“โˆž norm by 9.26% and kurtosis by 2.60% on average across three datasets. Under W8A8 quantization, it lowers perplexity by 75.85%, and under aggressive W4A4 settings, it improves GSM8K Pass@1 accuracy by 12.42%, substantially enhancing architectural robustness.
๐Ÿ“ Abstract
We propose OASIS, an outlier- and sink-aware technique built on inter-layer null signaling. As AttnResidual architectures introduce an additional depth-wise normalization channel, they improve inter-layer routing flexibility but also exacerbate attention sinks, activation outliers, and the resulting degradation in inference stability and quantization robustness. OASIS addresses this issue by introducing a Softmax1-based null space and coupling token-level null evidence to depth routing through an inter-layer null signal, thereby reducing sink-dominated routing and improving structural robustness. Theoretically, we show that the dual-normalization design of AttnResidual intensifies sink formation and quantization brittleness. Experimentally, we compare OASIS against five baselines on three real-world datasets and observe consistent improvements in both attention sink and post-quantization performance. Notably, OASIS achieves an average reduction of 9.26% in maximum infinity norm and 2.60% in average kurtosis across the evaluated settings, while lowering perplexity by 75.85% under W8A8 and improving GSM8K Pass@1 by 12.42% under W4A4.
Problem

Research questions and friction points this paper is trying to address.

attention sinks
outliers
AttnResidual
quantization robustness
inference stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

attention sinks
activation outliers
inter-layer null signaling
AttnResidual
quantization robustness