Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates the statistical convergence rate of single-layer attention-style models for learning pairwise interactions. Addressing limitations in existing theory—where rates depend on label count, input dimensionality, or weight matrix rank—we propose a dimension-free minimax analysis framework: under only a β-Hölder continuity assumption on the activation function, we establish the optimal convergence rate $M^{-2eta/(2eta+1)}$, where $M$ is the sample size. Methodologically, we integrate nonparametric regression theory with attention’s nonlocal interaction modeling, providing the first rigorous characterization—under nonlinear activations—of the statistical efficiency jointly induced by weight matrices and feature interactions. Our key contribution is breaking the curse of dimensionality, thereby revealing an intrinsic statistical advantage of attention models in learning higher-order interactions; this yields the first tight theoretical benchmark for understanding their empirical efficacy.

Technology Category

Application Category

📝 Abstract
We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-frac{2β}{2β+1}}$ with $M$ being the sample size, depending only on the smoothness $β$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.
Problem

Research questions and friction points this paper is trying to address.

Dimension-free minimax rates for attention models
Learning pairwise interactions with non-linear activations
Statistical efficiency independent of dimension and rank
Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimax rates independent of token count
Convergence depends only on activation smoothness
Dimension-free statistical efficiency for attention models
🔎 Similar Papers
No similar papers found.
S
Shai Zucker
Department of Applied Mathematics, Tel Aviv University
X
Xiong Wang
School of Mathematics, Sun Yat-sen University, Guangzhou, China
Fei Lu
Fei Lu
Johns Hopkins University
applied probabilitystatistical learninginverse problemsdata assimilation
Inbar Seroussi
Inbar Seroussi
Tel-Aviv University, Israel
Probability theoryStatistical physicsMachine learning