🤖 AI Summary
It remains unclear whether standard Softmax-attention Transformers can realize nonlinear kernel regression with convergence guarantees. This work constructs a single-head Transformer whose forward pass is equivalent to preconditioned Richardson iteration, thereby approximating the solution of Gaussian kernel ridge regression. For the first time, it is theoretically established that the standard Softmax attention mechanism can implement an end-to-end nonlinear kernel regressor with controllable error, revealing a functional decomposition: attention models cross-token interactions while the ReLU MLP performs local scalar operations. The constructed model achieves ε accuracy over contexts of length N using only O(log(1/ε)) layers and an MLP width of O(√(N/ε)). Experiments confirm that its error evolution closely matches that of preconditioned Richardson iteration.
📝 Abstract
Mechanistic accounts of in-context learning (ICL) have identified iterative algorithms for linear regression and related linear prediction tasks, often using linear or ReLU attention variants. For nonlinear ICL, prior work has related softmax and kernelized attention to functional-gradient-type dynamics, but it remains unclear whether a standard transformer with softmax attention can implement a convergent solver with an end-to-end prediction-error guarantee. In this paper, we study in-context kernel ridge regression (KRR) with Gaussian kernels and show that a standard softmax-attention transformer can approximate the KRR predictor during its forward pass by implementing preconditioned Richardson iteration on the associated kernel linear system. Under bounded-data assumptions, we construct a single-head transformer with $O(\log(1/ε))$ blocks and MLP width $O(\sqrt{N/ε})$ that achieves $ε$-accurate prediction for prompts of length $N$. Our construction reveals a functional decomposition within the transformer architecture: softmax attention produces a row-normalized Gaussian-kernel operator needed for cross-token interactions, while ReLU MLP layers act locally to approximate the intra-token scalar arithmetic required by the update. Empirically, we train GPT-2-style transformers on Gaussian-process regression tasks to further test the preconditioned Richardson interpretation. Through linear probing, we compare the transformer's layer-wise predictions with the step-wise outputs of classical KRR solvers and find that its error profiles align most consistently with preconditioned Richardson iteration. Ablation studies further support this interpretation. Together, our theory and experiments identify preconditioned Richardson iteration as a concrete mechanism that softmax-attention transformers can realize for nonlinear in-context Gaussian-kernel regression.