๐ค AI Summary
This work addresses the instability and degraded quantization robustness in AttnResidual architectures, which stem from excessive attention concentration and activation outliers caused by their dual-normalization design. To tackle this issue, we propose OASIS, the first method to uncover this underlying mechanism and introduce a null-aware inter-layer signal regulation scheme. By modeling the null space of Softmaxยน, OASIS couples token-level null evidence with deep routing to suppress dominant attention aggregation. Experiments demonstrate that OASIS reduces the maximum โโ norm by 9.26% and kurtosis by 2.60% on average across three datasets. Under W8A8 quantization, it lowers perplexity by 75.85%, and under aggressive W4A4 settings, it improves GSM8K Pass@1 accuracy by 12.42%, substantially enhancing architectural robustness.
๐ Abstract
We propose OASIS, an outlier- and sink-aware technique built on inter-layer null signaling. As AttnResidual architectures introduce an additional depth-wise normalization channel, they improve inter-layer routing flexibility but also exacerbate attention sinks, activation outliers, and the resulting degradation in inference stability and quantization robustness. OASIS addresses this issue by introducing a Softmax1-based null space and coupling token-level null evidence to depth routing through an inter-layer null signal, thereby reducing sink-dominated routing and improving structural robustness. Theoretically, we show that the dual-normalization design of AttnResidual intensifies sink formation and quantization brittleness. Experimentally, we compare OASIS against five baselines on three real-world datasets and observe consistent improvements in both attention sink and post-quantization performance. Notably, OASIS achieves an average reduction of 9.26% in maximum infinity norm and 2.60% in average kurtosis across the evaluated settings, while lowering perplexity by 75.85% under W8A8 and improving GSM8K Pass@1 by 12.42% under W4A4.