Learning positional encodings in transformers depends on initialization

📅 2024-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Positional encodings (PEs) struggle to model spatial structure in complex permutation-invariant data—e.g., 3D neuroscience recordings and nonlinear network simulations—due to inherent geometric nontriviality. Method: We systematically investigate the generalization of learnable PEs on nontrivial geometries, introducing small-norm random initialization as a critical design principle. Our approach integrates learnable PEs with attention mechanisms and multidimensional relational reasoning, and employs PE visualization and geometric alignment evaluation for interpretability analysis. Contribution/Results: We provide the first theoretical and empirical validation that small-norm initialization decisively governs PE learnability, interpretability, and downstream generalization—poor initialization simultaneously degrades both interpretability and performance. On 2D relational reasoning, nonlinear dynamical system simulation, and real-world 3D neural data, our method successfully recovers ground-truth spatial structure and achieves significant gains in generalization performance.

Technology Category

Application Category

📝 Abstract
The attention mechanism is central to the transformer's ability to capture complex dependencies between tokens of an input sequence. Key to the successful application of the attention mechanism in transformers is its choice of positional encoding (PE). The PE provides essential information that distinguishes the position and order amongst tokens in a sequence. Most prior investigations of PE effects on generalization were tailored to 1D input sequences, such as those presented in natural language, where adjacent tokens (e.g., words) are highly related. In contrast, many real world tasks involve datasets with highly non-trivial positional arrangements, such as datasets organized in multiple spatial dimensions, or datasets for which ground truth positions are not known, such as in biological data. Here we study the importance of learning accurate PE for problems which rely on a non-trivial arrangement of input tokens. Critically, we find that the choice of initialization of a learnable PE greatly influences its ability to learn accurate PEs that lead to enhanced generalization. We empirically demonstrate our findings in three experiments: 1) A 2D relational reasoning task; 2) A nonlinear stochastic network simulation; 3) A real world 3D neuroscience dataset, applying interpretability analyses to verify the learning of accurate PEs. Overall, we find that a learned PE initialized from a small-norm distribution can 1) uncover interpretable PEs that mirror ground truth positions in multiple dimensions, and 2) lead to improved downstream generalization in empirical evaluations. Importantly, choosing an ill-suited PE can be detrimental to both model interpretability and generalization. Together, our results illustrate the feasibility of learning identifiable and interpretable PEs for enhanced generalization.
Problem

Research questions and friction points this paper is trying to address.

Positional Encoding
Transformer Models
Multidimensional Data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Positional Encoding
Transformer Models
Learning Range Optimization
🔎 Similar Papers
No similar papers found.