Sparse Attention as Compact Kernel Regression

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of theoretical grounding in existing sparse attention mechanisms, whose sparsity patterns and design principles remain largely heuristic. By establishing a formal correspondence between sparse attention and kernel regression with compact support, the study reveals that α-entmax attention (with α = 1 + 1/n) is equivalent to Nadaraya–Watson estimators derived from classical bounded-support kernels such as Epanechnikov and biweight. This equivalence demonstrates that sparsity arises inherently from the bounded support of these kernels, offering a principled theoretical alternative to ad hoc strategies like top-k selection. Leveraging this insight, the authors develop a tunable sparse attention mechanism integrated into the Memory Mosaics architecture, achieving performance on par with state-of-the-art models across language modeling, in-context learning, and length generalization tasks, thereby validating the efficacy and generality of the proposed theoretical framework.

Technology Category

Application Category

📝 Abstract
Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation -- including Epanechnikov, biweight, and triweight -- correspond to $\alpha$-entmax attention with $\alpha = 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n \to \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers -- Memory Mosaics -- show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.
Problem

Research questions and friction points this paper is trying to address.

sparse attention
kernel regression
compact kernels
transformers
Nadaraya-Watson estimator
Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse attention
kernel regression
compact kernel
α-entmax
Nadaraya-Watson estimator
🔎 Similar Papers
No similar papers found.
S
Saul Santos
Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal; Instituto de Telecomunicações, Lisbon, Portugal
Nuno Gonçalves
Nuno Gonçalves
Institute for Systems and Robotics, University of Coimbra
BiometricsComputer VisionSteganographyRoboticsMedical Imaging
D
Daniel C. McNamee
Champalimaud Research, Lisbon, Portugal
A
André F.T Martins
Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal; Instituto de Telecomunicações, Lisbon, Portugal; TransPerfect, Lisbon, Portugal