Projection-Free Transformers via Gaussian Kernel Attention

📅 2026-05-03

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work proposes Gaussian Kernel Attention (GKA), a novel attention mechanism that replaces the conventional learnable query, key, and value projections in Transformers with a Gaussian radial basis function to directly measure token similarity. By eliminating Q/K/V projections and introducing only a single learnable bandwidth parameter per head, GKA achieves a highly compact and interpretable architecture with explicit local-scale modeling. The method establishes, for the first time, a theoretical connection between attention mechanisms and non-local filtering through normalized kernel regression. Seamlessly integrated into standard Transformer architectures, GKA demonstrates competitive performance on standard benchmarks while reducing model parameters and training FLOPs to 0.42× and 0.49× of the baseline, respectively, with stable training dynamics and nearly negligible generalization gap.

📝 Abstract

Self-attention in Transformers is typically implemented as $\mathrm{softmax}(QK^\top/\sqrt{d})V$, where $Q=XW_Q$, $K=XW_K$, and $V=XW_V$ are learned linear projections of the input $X$. We ask whether these learned projections are necessary, or whether they can be replaced by a simpler similarity-based diffusion operator. We introduce \textbf{Gaussian Kernel Attention} (GKA), a drop-in replacement for dot-product attention that computes token affinities directly using a Gaussian radial basis function (RBF) kernel applied to per-head token features. Each head learns only a bandwidth parameter $σ_h$, while a single output projection $W_O$ preserves compatibility with the standard Transformer interface. GKA can be interpreted as normalized kernel regression over tokens, linking modern Transformer architectures to classical non-local filtering and kernel smoothing methods. We evaluate GKA in both vision and language modeling settings. For autoregressive language modeling within the \texttt{nanochat} framework, we implement causal masking and sliding-window constraints by masking and renormalizing the Gaussian kernel. At depth 20, a GKA model with $0.42\times$ the parameters and $0.49\times$ the total training FLOPs of a standard attention baseline trains stably, exhibits a near-zero train-validation gap, and demonstrates competitive behavior on standard benchmarks, albeit with higher bits-per-byte (BPB) at this compute scale. Overall, GKA provides a minimal, interpretable attention mechanism with an explicit locality scale, offering a dimension in the accuracy-efficiency trade-off for Transformer design.

Problem

Research questions and friction points this paper is trying to address.

Projection-Free

Self-Attention

Transformer

Gaussian Kernel

Learned Projections

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian Kernel Attention

projection-free

RBF kernel