🤖 AI Summary
This paper investigates the sequence modeling efficiency of single-layer, single-head Transformers, aiming to characterize their expressive power for nonlinear sequence relationships from an approximation-theoretic perspective.
Method: We propose the first explicit approximation-rate theoretical framework tailored to Transformers, introducing a novel complexity measure grounded in frequency-domain properties and local correlations. Leveraging tools from harmonic analysis and function space theory, we derive tight Jackson-type upper bounds on approximation rates.
Contribution/Results: Our analysis establishes that, for sequences dominated by low frequencies and exhibiting strong local correlations, this Transformer architecture achieves exponential approximation advantages over RNNs. This work provides the first rigorous approximation-theoretic quantification of Transformer structural superiority, identifies the sequence patterns for which Transformers are inherently optimal, and furnishes principled theoretical foundations for designing lightweight Transformer variants.
📝 Abstract
The Transformer architecture is widely applied in sequence modeling applications, yet the theoretical understanding of its working principles remains limited. In this work, we investigate the approximation rate for single-layer Transformers with one head. We consider a class of non-linear relationships and identify a novel notion of complexity measures to establish an explicit Jackson-type approximation rate estimate for the Transformer. This rate reveals the structural properties of the Transformer and suggests the types of sequential relationships it is best suited for approximating. In particular, the results on approximation rates enable us to concretely analyze the differences between the Transformer and classical sequence modeling methods, such as recurrent neural networks.