Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing attention mechanisms—such as Softmax and linear attention—struggle to balance theoretical expressiveness with computational efficiency. To address this, we propose Local Linear Attention (LLA), the first attention formulation derived from a nonparametric statistical perspective on test-time regression, yielding both high expressiveness and controllable complexity. Theoretically, LLA achieves favorable bias–variance trade-offs. Practically, we introduce FlashLLA, an efficient parallel algorithm leveraging block-wise computation, memory-optimized primitives, and custom inference kernels to significantly reduce deployment overhead. Experiments demonstrate that LLA consistently outperforms mainstream baselines on non-stationary time-series modeling and in-context learning tasks, while exhibiting strong scalability. Its effectiveness and practical potential have been validated across diverse downstream applications.

Technology Category

Application Category

📝 Abstract

Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight-even at greater computational cost-has been relatively underexplored. In this work, we bridge this gap by proposing Local Linear Attention (LLA), a novel attention mechanism derived from nonparametric statistics through the lens of test-time regression. First, we show that LLA offers theoretical advantages over Linear and Softmax Attention for associative memory via a bias-variance trade-off analysis. Next, we address its computational challenges and propose two memory-efficient primitives to tackle the $Θ(n^2 d)$ and $Θ(n d^2)$ complexity. We then introduce FlashLLA, a hardware-efficient, blockwise algorithm that enables scalable and parallel computation on modern accelerators. In addition, we implement and profile a customized inference kernel that significantly reduces memory overheads. Finally, we empirically validate the advantages and limitations of LLA on test-time regression, in-context regression, associative recall and state tracking tasks. Experiment results demonstrate that LLA effectively adapts to non-stationarity, outperforming strong baselines in test-time training and in-context learning, and exhibiting promising evidence for its scalability and applicability in large-scale models. Code is available at https://github.com/Yifei-Zuo/Flash-LLA.

Problem

Research questions and friction points this paper is trying to address.

Proposes Local Linear Attention for test-time regression tasks

Addresses computational complexity challenges in attention mechanisms

Improves adaptation to non-stationarity in large-scale models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Local Linear Attention mechanism from nonparametric statistics

FlashLLA algorithm for scalable parallel computation

Customized inference kernel reducing memory overheads

🔎 Similar Papers

Attention layers provably solve single-location regression