๐ค AI Summary
This work addresses the quadratic computational complexity of Softmax-based Transformers in long-context scenarios and the limited expressivity of existing linear attention models. To reconcile efficiency and representational power, the authors propose NAtS-L, a novel framework that dynamically selects, at the token level, between linear attention (based on Gated DeltaNet) and Softmax attention. This approach enables fine-grained balancing of computation and expressiveness, moving beyond conventional architectures that rely on fixed, layer-wise hybrid attention mechanisms. By introducing an adaptive path selection mechanism, NAtS-L significantly reduces computational overhead while preserving model performance, thereby demonstrating the effectiveness and efficiency of token-level mixed attention architectures.
๐ Abstract
The quadratic computational complexity of softmax transformers has become a bottleneck in long-context scenarios. In contrast, linear attention model families provide a promising direction towards a more efficient sequential model. These linear attention models compress past KV values into a single hidden state, thereby efficiently reducing complexity during both training and inference. However, their expressivity remains limited by the size of their hidden state. Previous work proposed interleaving softmax and linear attention layers to reduce computational complexity while preserving expressivity. Nevertheless, the efficiency of these models remains bottlenecked by their softmax attention layers. In this paper, we propose Neural Attention Search Linear (NAtS-L), a framework that applies both linear attention and softmax attention operations within the same layer on different tokens. NAtS-L automatically determines whether a token can be handled by a linear attention model, i.e., tokens that have only short-term impact and can be encoded into fixed-size hidden states, or require softmax attention, i.e., tokens that contain information related to long-term retrieval and need to be preserved for future queries. By searching for optimal Gated DeltaNet and softmax attention combinations across tokens, we show that NAtS-L provides a strong yet efficient token-level hybrid architecture.