Sparse Attention as Compact Kernel Regression

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of theoretical grounding in existing sparse attention mechanisms, whose sparsity patterns and design principles remain largely heuristic. By establishing a formal correspondence between sparse attention and kernel regression with compact support, the study reveals that α-entmax attention (with α = 1 + 1/n) is equivalent to Nadaraya–Watson estimators derived from classical bounded-support kernels such as Epanechnikov and biweight. This equivalence demonstrates that sparsity arises inherently from the bounded support of these kernels, offering a principled theoretical alternative to ad hoc strategies like top-k selection. Leveraging this insight, the authors develop a tunable sparse attention mechanism integrated into the Memory Mosaics architecture, achieving performance on par with state-of-the-art models across language modeling, in-context learning, and length generalization tasks, thereby validating the efficacy and generality of the proposed theoretical framework.

Technology Category

Application Category

📝 Abstract

Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation -- including Epanechnikov, biweight, and triweight -- correspond to $\alpha$-entmax attention with $\alpha = 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n \to \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers -- Memory Mosaics -- show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.

Problem

Research questions and friction points this paper is trying to address.

sparse attention

kernel regression

compact kernels

transformers

Nadaraya-Watson estimator

Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse attention

kernel regression

compact kernel

α-entmax

Nadaraya-Watson estimator

🔎 Similar Papers

No similar papers found.

Authors to Follow