🤖 AI Summary
This work formalizes the notion of “algorithmic grokking,” distinguishing genuine algorithmic understanding by neural networks from mere statistical interpolation, and investigates the generalization capabilities and computational complexity constraints of Transformers across varying problem scales. By analyzing infinite-width Transformers under both lazy and rich training regimes and integrating computational complexity theory with the EPTHS (Efficient Polynomial-Time Heuristic Scheme) framework, the study establishes, for the first time, an upper bound on the computational complexity of functions learnable by such models. The findings reveal that despite their universal expressive power, Transformers exhibit inherent inductive biases favoring low-complexity algorithms—such as search, copy, and sorting—and struggle to generalize to higher-complexity tasks, thereby elucidating fundamental limitations and preferences in their algorithmic learning behavior.
📝 Abstract
We formally define Algorithmic Capture (i.e., ``grokking'' an algorithm) as the ability of a neural network to generalize to arbitrary problem sizes ($T$) with controllable error and minimal sample adaptation, distinguishing true algorithmic learning from statistical interpolation. By analyzing infinite-width transformers in both the lazy and rich regimes, we derive upper bounds on the inference-time computational complexity of the functions these networks can learn. We show that despite their universal expressivity, transformers possess an inductive bias towards low-complexity algorithms within the Efficient Polynomial Time Heuristic Scheme (EPTHS) class. This bias effectively prevents them from capturing higher-complexity algorithms, while allowing success on simpler tasks like search, copy, and sort.