🤖 AI Summary
To address the trade-off between the quadratic complexity of soft attention and the accuracy degradation of existing linear attention methods—caused by fixed, non-adaptive random feature mappings—this paper proposes LUNA, a linear attention mechanism with learnable kernel feature mappings. Its core innovation lies in parameterizing the feature mapping of a kernel function and optimizing it end-to-end, enabling the first adaptive learning of feature mappings within kernelized linear attention. LUNA preserves O(n) time and space complexity while substantially enhancing modeling capacity. Theoretically, it guarantees positive-definiteness of the induced kernel and ensures generalization bounds, supporting efficient streaming inference. Experiments demonstrate that LUNA achieves state-of-the-art average accuracy on Long Range Arena and significantly outperforms fixed-mapping baselines when applied as a drop-in replacement in post-hoc BERT and ViT attention layers—nearly fully recovering the performance of the original models.
📝 Abstract
Scaling attention faces a critical bottleneck: the $mathcal{O}(n^2)$ quadratic computational cost of softmax attention, which limits its application in long-sequence domains. While linear attention mechanisms reduce this cost to $mathcal{O}(n)$, they typically rely on fixed random feature maps, such as random Fourier features or hand-crafted functions. This reliance on static, data-agnostic kernels creates a fundamental trade-off, forcing practitioners to sacrifice significant model accuracy for computational efficiency. We introduce extsc{LUNA}, a kernelized linear attention mechanism that eliminates this trade-off, retaining linear cost while matching and surpassing the accuracy of quadratic attention. extsc{LUNA} is built on the key insight that the kernel feature map itself should be learned rather than fixed a priori. By parameterizing the kernel, extsc{LUNA} learns a feature basis tailored to the specific data and task, overcoming the expressive limitations of fixed-feature methods. extsc{Luna} implements this with a learnable feature map that induces a positive-definite kernel and admits a streaming form, yielding linear time and memory scaling in the sequence length. Empirical evaluations validate our approach across diverse settings. On the Long Range Arena (LRA), extsc{Luna} achieves state-of-the-art average accuracy among efficient Transformers under compute parity, using the same parameter count, training steps, and approximate FLOPs. extsc{Luna} also excels at post-hoc conversion: replacing softmax in fine-tuned BERT and ViT-B/16 checkpoints and briefly fine-tuning recovers most of the original performance, substantially outperforming fixed linearizations.