🤖 AI Summary
This paper investigates whether a single-layer Transformer without positional encoding possesses the universal approximation property (UAP) for vocabulary-in-context learning (VICL). Theoretically, we prove that in the absence of positional encoding, the model cannot achieve VICL-UAP; however, introducing positional encodings satisfying specific spectral conditions—such as sinusoidal encoding—strictly restores UAP. Our analysis is grounded in function approximation theory, where we formally model and analyze VICL capability via mathematical characterization of representational capacity. This work establishes, for the first time, a necessary and sufficient framework linking the existence of positional encoding to VICL-UAP. The results demonstrate, from an approximation-theoretic perspective, that positional encoding is both *necessary and sufficient* for VICL-UAP—not merely a heuristic aid for sequence modeling, but a fundamental theoretical prerequisite for contextual generalization. This provides a novel paradigm for understanding the essential role of positional information in Transformers.
📝 Abstract
Numerous studies have demonstrated that the Transformer architecture possesses the capability for in-context learning (ICL). In scenarios involving function approximation, context can serve as a control parameter for the model, endowing it with the universal approximation property (UAP). In practice, context is represented by tokens from a finite set, referred to as a vocabulary, which is the case considered in this paper, emph{i.e.}, vocabulary in-context learning (VICL). We demonstrate that VICL in single-layer Transformers, without positional encoding, does not possess the UAP; however, it is possible to achieve the UAP when positional encoding is included. Several sufficient conditions for the positional encoding are provided. Our findings reveal the benefits of positional encoding from an approximation theory perspective in the context of ICL.