🤖 AI Summary
This work investigates the approximation and generalization capabilities of Transformers for regression tasks over compact Euclidean domains and Riemannian manifolds. The authors propose a constructive framework in which attention mechanisms induce spatial localization, enabling local function approximation via affine transformations and softmax activations, which are then aggregated into a global approximant. They establish, for the first time, that a shallow, wide Transformer with only two encoder blocks can uniformly approximate any Hölder continuous function up to an error ε, using 𝒪(ε⁻ᵈ/ᵅ) parameters. Furthermore, they derive a near-minimax optimal generalization error bound of 𝒪(n⁻²ᵅ/(²ᵅ⁺ᵈ) log n), thereby demonstrating the theoretical superiority of their approach.
📝 Abstract
This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain $[0,1]^d$ and $d$-dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for Transformers that builds local approximations of the target function and aggregates them into a global approximation via softmax partition of unity. This approach leverages the attention mechanism to achieve spatial localization through affine transformations of the input. The softmax activation plays a crucial role in aggregating local approximations to a global output. From an approximation perspective, we prove that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks can achieve a uniform $\varepsilon$-approximation error for $α$-Hölder continuous functions with $α\in (0,1]$ using $\mathcal{O}(\varepsilon^{-d/α})$ total parameters. Building upon this approximation guarantee, we establish a near minimax-optimal generalization error bound of order $\mathcal{O}\big(n^{-\frac{2α}{2α+d}} \log n\big)$ for the empirical risk minimizer, where $n$ is the training data size. The Transformer architecture studied in this paper is dense, shallow and wide, and employs softmax activation and sinusoidal positional encodings, closely reflecting practical implementations.