Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,\lambda}$ Targets

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work investigates the approximation capability and statistical optimality of standard Transformers in nonparametric regression for Hölder continuous functions. By integrating tools from function approximation theory, Hölder space analysis, and mathematical modeling of Transformers, the study establishes, for the first time, that standard Transformers can approximate any Hölder function to arbitrary accuracy under the $L^t$ norm and achieve minimax-optimal convergence rates in regression tasks. The key innovation lies in introducing a size tuple and a dimension vector to precisely characterize the model architecture, enabling the derivation of tight upper bounds on the Lipschitz constant and memory capacity. These results provide a rigorous theoretical foundation for understanding the generalization properties and optimization behavior of Transformers.

Technology Category

Application Category

📝 Abstract

The tremendous success of Transformer models in fields such as large language models and computer vision necessitates a rigorous theoretical investigation. To the best of our knowledge, this paper is the first work proving that standard Transformers can approximate H\"older functions $ C^{s,\lambda}\left([0,1]^{d\times n}\right) $$ (s\in\mathbb{N}_{\geq0},0<\lambda\leq1) $ under the $L^t$ distance ($t \in [1, \infty]$) with arbitrary precision. Building upon this approximation result, we demonstrate that standard Transformers achieve the minimax optimal rate in nonparametric regression for H\"older target functions. It is worth mentioning that, by introducing two metrics: the size tuple and the dimension vector, we provide a fine-grained characterization of Transformer structures, which facilitates future research on the generalization and optimization errors of Transformers with different structures. As intermediate results, we also derive the upper bounds for the Lipschitz constant of standard Transformers and their memorization capacity, which may be of independent interest. These findings provide theoretical justification for the powerful capabilities of Transformer models.

Problem

Research questions and friction points this paper is trying to address.

Transformers

nonparametric regression

minimax rate

Hölder functions

approximation theory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer

nonparametric regression

minimax rate