Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? Revisiting a Petroglyph

📅 2024-12-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
It remains unclear why deep autoregressive Transformers can model sequential order without explicit positional encodings (PEs). Method: We combine theoretical analysis, attention weight attribution, controlled experiments comparing single- and multi-layer models, and historical literature review. Results: We prove that autoregressive Transformers with two or more layers can strictly distinguish arbitrary permutations of input sequences; their implicit positional awareness arises from the synergistic interaction between attention mechanisms and feed-forward networks in deeper layers. In contrast, single-layer models provably require PEs. This work provides the first systematic characterization of the intrinsic sequential representation mechanism in multi-layer Transformers, refuting the long-standing misconception that PEs are strictly necessary. It establishes, as a new community consensus, that deep autoregressive Transformers possess inherent implicit positional modeling capacity—without relying on explicit positional encoding schemes.

Technology Category

Application Category

📝 Abstract
Do autoregressive Transformer language models require explicit positional encodings (PEs)? The answer is"no"as long as they have more than one layer -- they can distinguish sequences with permuted tokens without requiring explicit PEs. This property has been known since early efforts (those contemporary with GPT-2) adopting the Transformer for language modeling. However, this result does not appear to have been well disseminated and was even rediscovered recently. This may be partially due to a sudden growth of the language modeling community after the advent of GPT-2, but perhaps also due to the lack of a clear explanation in prior publications, despite being commonly understood by practitioners in the past. Here we review this long-forgotten explanation why explicit PEs are nonessential for multi-layer autoregressive Transformers (in contrast, one-layer models require PEs to discern order information of their input tokens). We also review the origin of this result, and hope to re-establish it as a common knowledge.
Problem

Research questions and friction points this paper is trying to address.

Positional Encoding
Transformer Models
Sequence Discrimination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-layer Autoregressive Transformer
Positional Encoding Independence
Reversed Word Order Processing
🔎 Similar Papers
No similar papers found.