Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work investigates the approximation and generalization capabilities of Transformers for regression tasks over compact Euclidean domains and Riemannian manifolds. The authors propose a constructive framework in which attention mechanisms induce spatial localization, enabling local function approximation via affine transformations and softmax activations, which are then aggregated into a global approximant. They establish, for the first time, that a shallow, wide Transformer with only two encoder blocks can uniformly approximate any Hölder continuous function up to an error ε, using 𝒪(ε⁻ᵈ/ᵅ) parameters. Furthermore, they derive a near-minimax optimal generalization error bound of 𝒪(n⁻²ᵅ/(²ᵅ⁺ᵈ) log n), thereby demonstrating the theoretical superiority of their approach.

📝 Abstract

This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain $[0,1]^d$ and $d$-dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for Transformers that builds local approximations of the target function and aggregates them into a global approximation via softmax partition of unity. This approach leverages the attention mechanism to achieve spatial localization through affine transformations of the input. The softmax activation plays a crucial role in aggregating local approximations to a global output. From an approximation perspective, we prove that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks can achieve a uniform $\varepsilon$-approximation error for $α$-Hölder continuous functions with $α\in (0,1]$ using $\mathcal{O}(\varepsilon^{-d/α})$ total parameters. Building upon this approximation guarantee, we establish a near minimax-optimal generalization error bound of order $\mathcal{O}\big(n^{-\frac{2α}{2α+d}} \log n\big)$ for the empirical risk minimizer, where $n$ is the training data size. The Transformer architecture studied in this paper is dense, shallow and wide, and employs softmax activation and sinusoidal positional encodings, closely reflecting practical implementations.

Problem

Research questions and friction points this paper is trying to address.

Transformers

learning theory

approximation

regression

generalization error

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer approximation theory

softmax partition of unity

local-to-global approximation