On the Geometry of Positional Encodings in Transformers

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Transformer models are inherently insensitive to word order and rely on positional encodings to inject sequential information, yet existing designs often lack a rigorous theoretical foundation. This work proposes a geometric framework for positional encoding, establishing its necessity and separability, and derives a minimally parameterized representation. Building upon the Hellinger distance and classical multidimensional scaling (MDS), the authors construct an information-theoretically optimal encoding scheme. By leveraging matrix rank analysis and neural tangent kernel (NTK) theory, they unify the evaluation of encoding quality into a single stress metric. Empirical validation on SST-2 and IMDB demonstrates that ALiBi encodings exhibit significantly lower stress compared to sinusoidal and RoPE encodings, corroborating their near rank-1 optimal structure.

Technology Category

Application Category

📝 Abstract

Neural language models process sequences of words, but the mathematical operations inside them are insensitive to the order in which words appear. Positional encodings are the component added to remedy this. Despite their importance, positional encodings have been designed largely by trial and error, without a mathematical theory of what they ought to do. This paper develops such a theory. Four results are established. First, any Transformer without a positional signal cannot solve any task sensitive to word order (Necessity Theorem). Second, training assigns distinct vector representations to distinct sequence positions at every global minimiser, under mild and verifiable conditions (Positional Separation Theorem). Third, the best achievable approximation to an information-optimal encoding is constructed via classical multidimensional scaling (MDS) on the Hellinger distance between positional distributions; the quality of any encoding is measured by a single number, the stress (Proposition 5, Algorithm 1). Fourth, the optimal encoding has effective rank r = rank(B) <= n-1 and can be represented with r(n+d) parameters instead of nd (minimal parametrisation result). Appendix A develops a proof of the Monotonicity Conjecture within the Neural Tangent Kernel (NTK) regime for masked language modelling (MLM) losses, sequence classification losses, and general losses satisfying a positional sufficiency condition, through five lemmas. Experiments on SST-2 and IMDB with BERT-base confirm the theoretical predictions and reveal that Attention with Linear Biases (ALiBi) achieves much lower stress than the sinusoidal encoding and Rotary Position Embedding (RoPE), consistent with a rank-1 interpretation of the MDS encoding under approximate shift-equivariance.

Problem

Research questions and friction points this paper is trying to address.

positional encodings

Transformers

word order

mathematical theory

sequence modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

positional encoding

multidimensional scaling

Hellinger distance