Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models

๐Ÿ“… 2026-02-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the quadratic computational complexity of Softmax-based Transformers in long-context scenarios and the limited expressivity of existing linear attention models. To reconcile efficiency and representational power, the authors propose NAtS-L, a novel framework that dynamically selects, at the token level, between linear attention (based on Gated DeltaNet) and Softmax attention. This approach enables fine-grained balancing of computation and expressiveness, moving beyond conventional architectures that rely on fixed, layer-wise hybrid attention mechanisms. By introducing an adaptive path selection mechanism, NAtS-L significantly reduces computational overhead while preserving model performance, thereby demonstrating the effectiveness and efficiency of token-level mixed attention architectures.

Technology Category

Application Category

๐Ÿ“ Abstract
The quadratic computational complexity of softmax transformers has become a bottleneck in long-context scenarios. In contrast, linear attention model families provide a promising direction towards a more efficient sequential model. These linear attention models compress past KV values into a single hidden state, thereby efficiently reducing complexity during both training and inference. However, their expressivity remains limited by the size of their hidden state. Previous work proposed interleaving softmax and linear attention layers to reduce computational complexity while preserving expressivity. Nevertheless, the efficiency of these models remains bottlenecked by their softmax attention layers. In this paper, we propose Neural Attention Search Linear (NAtS-L), a framework that applies both linear attention and softmax attention operations within the same layer on different tokens. NAtS-L automatically determines whether a token can be handled by a linear attention model, i.e., tokens that have only short-term impact and can be encoded into fixed-size hidden states, or require softmax attention, i.e., tokens that contain information related to long-term retrieval and need to be preserved for future queries. By searching for optimal Gated DeltaNet and softmax attention combinations across tokens, we show that NAtS-L provides a strong yet efficient token-level hybrid architecture.
Problem

Research questions and friction points this paper is trying to address.

linear attention
softmax attention
computational complexity
long-context
hybrid attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

linear attention
token-level hybrid attention
adaptive attention mechanism
Neural Attention Search Linear
efficient transformers
๐Ÿ”Ž Similar Papers
No similar papers found.
Difan Deng
Difan Deng
Leibniz Universitรคt Hannover
AutoML
A
Andreas Bentzen Winje
Institute of Artificial Intelligence, Leibniz University Hannover, Hannover, Germany
L
Lukas Fehring
Institute of Artificial Intelligence, Leibniz University Hannover, Hannover, Germany
Marius Lindauer
Marius Lindauer
Leibniz University Hannover (Germany), Institute of Artificial Intelligence LUH|AI, L3S Research
Machine LearningAutoMLReinforcement LearningInterpretable Machine LearningArtificial Intelligence