Softmax Linear Attention: Reclaiming Global Competition

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the limitation of linear attention mechanisms, which lack a global competition mechanism due to the omission of softmax normalization and consequently struggle to focus on critical information amid noise in long contexts. The authors propose Softmax Linear Attention (SLA), which elevates the softmax operation from the token level to the attention head level, treating multiple heads as coarse-grained semantic slots. By introducing a dynamic competitive gating mechanism, SLA restores a “winner-takes-all” global competition capability while preserving linear computational complexity. This approach overcomes the prior focus on optimizing only local kernel functions and demonstrates significant improvements over baselines such as RetNet, GLA, and GDN in language modeling and long-context benchmarks, exhibiting notably enhanced robustness in noisy retrieval tasks.

Technology Category

Application Category

📝 Abstract

While linear attention reduces the quadratic complexity of standard Transformers to linear time, it often lags behind in expressivity due to the removal of softmax normalization. This omission eliminates \emph{global competition}, a critical mechanism that enables models to sharply focus on relevant information amidst long-context noise. In this work, we propose \textbf{Softmax Linear Attention (SLA)}, a framework designed to restore this competitive selection without sacrificing efficiency. By lifting the softmax operation from the token level to the head level, SLA leverages attention heads as coarse semantic slots, applying a competitive gating mechanism to dynamically select the most relevant subspaces. This reintroduces the ``winner-take-all''dynamics essential for precise retrieval and robust long-context understanding. Distinct from prior methods that focus on refining local kernel functions, SLA adopts a broader perspective by exploiting the higher-level multi-head aggregation structure. Extensive experiments demonstrate that SLA consistently enhances state-of-the-art linear baselines (RetNet, GLA, GDN) across language modeling and long-context benchmarks, particularly in challenging retrieval scenarios where it significantly boosts robustness against noise, validating its capability to restore precise focus while maintaining linear complexity.

Problem

Research questions and friction points this paper is trying to address.

linear attention

softmax normalization

global competition

long-context understanding

attention mechanism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Softmax Linear Attention

global competition

linear attention