🤖 AI Summary
It remains unclear why deep autoregressive Transformers can model sequential order without explicit positional encodings (PEs).
Method: We combine theoretical analysis, attention weight attribution, controlled experiments comparing single- and multi-layer models, and historical literature review.
Results: We prove that autoregressive Transformers with two or more layers can strictly distinguish arbitrary permutations of input sequences; their implicit positional awareness arises from the synergistic interaction between attention mechanisms and feed-forward networks in deeper layers. In contrast, single-layer models provably require PEs. This work provides the first systematic characterization of the intrinsic sequential representation mechanism in multi-layer Transformers, refuting the long-standing misconception that PEs are strictly necessary. It establishes, as a new community consensus, that deep autoregressive Transformers possess inherent implicit positional modeling capacity—without relying on explicit positional encoding schemes.
📝 Abstract
Do autoregressive Transformer language models require explicit positional encodings (PEs)? The answer is"no"as long as they have more than one layer -- they can distinguish sequences with permuted tokens without requiring explicit PEs. This property has been known since early efforts (those contemporary with GPT-2) adopting the Transformer for language modeling. However, this result does not appear to have been well disseminated and was even rediscovered recently. This may be partially due to a sudden growth of the language modeling community after the advent of GPT-2, but perhaps also due to the lack of a clear explanation in prior publications, despite being commonly understood by practitioners in the past. Here we review this long-forgotten explanation why explicit PEs are nonessential for multi-layer autoregressive Transformers (in contrast, one-layer models require PEs to discern order information of their input tokens). We also review the origin of this result, and hope to re-establish it as a common knowledge.