Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This paper investigates the statistical convergence rate of single-layer attention-style models for learning pairwise interactions. Addressing limitations in existing theory—where rates depend on label count, input dimensionality, or weight matrix rank—we propose a dimension-free minimax analysis framework: under only a β-Hölder continuity assumption on the activation function, we establish the optimal convergence rate $M^{-2eta/(2eta+1)}$, where $M$ is the sample size. Methodologically, we integrate nonparametric regression theory with attention’s nonlocal interaction modeling, providing the first rigorous characterization—under nonlinear activations—of the statistical efficiency jointly induced by weight matrices and feature interactions. Our key contribution is breaking the curse of dimensionality, thereby revealing an intrinsic statistical advantage of attention models in learning higher-order interactions; this yields the first tight theoretical benchmark for understanding their empirical efficacy.

Technology Category

Application Category

📝 Abstract

We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-frac{2β}{2β+1}}$ with $M$ being the sample size, depending only on the smoothness $β$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.

Problem

Research questions and friction points this paper is trying to address.

Dimension-free minimax rates for attention models

Learning pairwise interactions with non-linear activations

Statistical efficiency independent of dimension and rank

Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimax rates independent of token count

Convergence depends only on activation smoothness

Dimension-free statistical efficiency for attention models

🔎 Similar Papers

No similar papers found.