🤖 AI Summary
This paper investigates the statistical convergence rate of single-layer attention-style models for learning pairwise interactions. Addressing limitations in existing theory—where rates depend on label count, input dimensionality, or weight matrix rank—we propose a dimension-free minimax analysis framework: under only a β-Hölder continuity assumption on the activation function, we establish the optimal convergence rate $M^{-2eta/(2eta+1)}$, where $M$ is the sample size. Methodologically, we integrate nonparametric regression theory with attention’s nonlocal interaction modeling, providing the first rigorous characterization—under nonlinear activations—of the statistical efficiency jointly induced by weight matrices and feature interactions. Our key contribution is breaking the curse of dimensionality, thereby revealing an intrinsic statistical advantage of attention models in learning higher-order interactions; this yields the first tight theoretical benchmark for understanding their empirical efficacy.
📝 Abstract
We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-frac{2β}{2β+1}}$ with $M$ being the sample size, depending only on the smoothness $β$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.