LUNA: Linear Universal Neural Attention with Generalization Guarantees

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between the quadratic complexity of soft attention and the accuracy degradation of existing linear attention methods—caused by fixed, non-adaptive random feature mappings—this paper proposes LUNA, a linear attention mechanism with learnable kernel feature mappings. Its core innovation lies in parameterizing the feature mapping of a kernel function and optimizing it end-to-end, enabling the first adaptive learning of feature mappings within kernelized linear attention. LUNA preserves O(n) time and space complexity while substantially enhancing modeling capacity. Theoretically, it guarantees positive-definiteness of the induced kernel and ensures generalization bounds, supporting efficient streaming inference. Experiments demonstrate that LUNA achieves state-of-the-art average accuracy on Long Range Arena and significantly outperforms fixed-mapping baselines when applied as a drop-in replacement in post-hoc BERT and ViT attention layers—nearly fully recovering the performance of the original models.

Technology Category

Application Category

📝 Abstract
Scaling attention faces a critical bottleneck: the $mathcal{O}(n^2)$ quadratic computational cost of softmax attention, which limits its application in long-sequence domains. While linear attention mechanisms reduce this cost to $mathcal{O}(n)$, they typically rely on fixed random feature maps, such as random Fourier features or hand-crafted functions. This reliance on static, data-agnostic kernels creates a fundamental trade-off, forcing practitioners to sacrifice significant model accuracy for computational efficiency. We introduce extsc{LUNA}, a kernelized linear attention mechanism that eliminates this trade-off, retaining linear cost while matching and surpassing the accuracy of quadratic attention. extsc{LUNA} is built on the key insight that the kernel feature map itself should be learned rather than fixed a priori. By parameterizing the kernel, extsc{LUNA} learns a feature basis tailored to the specific data and task, overcoming the expressive limitations of fixed-feature methods. extsc{Luna} implements this with a learnable feature map that induces a positive-definite kernel and admits a streaming form, yielding linear time and memory scaling in the sequence length. Empirical evaluations validate our approach across diverse settings. On the Long Range Arena (LRA), extsc{Luna} achieves state-of-the-art average accuracy among efficient Transformers under compute parity, using the same parameter count, training steps, and approximate FLOPs. extsc{Luna} also excels at post-hoc conversion: replacing softmax in fine-tuned BERT and ViT-B/16 checkpoints and briefly fine-tuning recovers most of the original performance, substantially outperforming fixed linearizations.
Problem

Research questions and friction points this paper is trying to address.

Quadratic computational cost of softmax attention limits long-sequence applications
Fixed random feature maps in linear attention sacrifice accuracy for efficiency
Static data-agnostic kernels create fundamental trade-off between performance and speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns kernel feature maps for linear attention
Achieves linear computational cost with high accuracy
Enables post-hoc conversion of existing Transformer models
🔎 Similar Papers
No similar papers found.