🤖 AI Summary
Compact positional encodings in Transformers induce two fundamental phenomena—*isolation*, wherein models fail to jointly learn adjacent simple sequence patterns, and *continuity*, wherein learned sequences generate attractor basins that erroneously pull nearby sequences into incorrect fixed points. Method: We provide the first rigorous mathematical proof that any compact positional encoding necessarily gives rise to both phenomena, and establish their causal link to degraded generalization. Using attractor basin modeling, theoretical analysis, and controlled synthetic sequence experiments, we empirically validate these predictions across multiple Transformer architectures. Contribution/Results: Our findings demonstrate that sequence learning capacity is fundamentally constrained by the topological properties of positional encodings—not by model capacity or training strategy—thereby revealing a foundational theoretical limitation for Transformer architecture design. This work unifies isolation and continuity as inherent consequences of compactness, offering principled guidance for developing more expressive positional representations.
📝 Abstract
Understanding how Transformers work and how they process information is key to the theoretical and empirical advancement of these machines. In this work, we demonstrate the existence of two phenomena in Transformers, namely isolation and continuity. Both of these phenomena hinder Transformers to learn even simple pattern sequences. Isolation expresses that any learnable sequence must be isolated from another learnable sequence, and hence some sequences cannot be learned by a single Transformer at the same time. Continuity entails that an attractor basin forms around a learned sequence, such that any sequence falling in that basin will collapse towards the learned sequence. Here, we mathematically prove these phenomena emerge in all Transformers that use compact positional encoding, and design rigorous experiments, demonstrating that the theoretical limitations we shed light on occur on the practical scale.