Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

๐Ÿ“… 2025-05-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing studies lack a systematic analysis of how gating mechanisms enhance soft (softmax) attention. This work presents the first systematic investigation into the impact of gating position on attention performance. We propose head-specific sigmoid gating applied at the softmax attention output, enabling query-dependent nonlinear sparse modulation. This design effectively mitigates the โ€œattention sinkโ€ problem. Evaluated on both a 15B MoE and a 1.7B dense architecture trained on 3.5T tokens, our method significantly improves multi-task benchmark performance, training stability, learning-rate robustness, and long-context extrapolation capability. Ablation studies demonstrate that output-level gating consistently outperforms input- or intermediate-layer gating variants. The proposed approach is lightweight, architecture-agnostic, and requires no additional parameters beyond the gating weights. All code and pre-trained models are publicly released on Hugging Face and GitHub.

Technology Category

Application Category

๐Ÿ“ Abstract
Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates 'attention sink' and enhances long-context extrapolation performance, and we also release related $href{https://github.com/qiuzh20/gated_attention}{codes}$ and $href{https://huggingface.co/QwQZh/gated_attention}{models}$ to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Investigates gating effects on softmax attention variants
Improves performance via head-specific sigmoid gating
Mitigates attention sink and enhances long-context extrapolation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Head-specific sigmoid gate after SDPA
Non-linearity in softmax attention
Query-dependent sparse gating scores
๐Ÿ”Ž Similar Papers
No similar papers found.