🤖 AI Summary
This study investigates the origins of grokking in Transformers, focusing on how inductive biases drive the transition from memorization to generalization. Through systematic ablation experiments, the authors analyze the impact of architectural and optimization factors—such as the placement of Layer Normalization, learning rate, and weight decay—on the timing and mechanism of grokking. They find that different normalization pathways modulate shortcut learning and attention entropy, while the readout scale, a potential control variable for lazy training, is highly sensitive to hyperparameter choices. Moreover, feature compressibility evolves continuously throughout training and serves as a reliable predictor of when generalization emerges. These findings reveal how inductive biases shape internal feature structures to guide generalization in deep sequence models.
📝 Abstract
We investigate grokking in transformers through the lens of inductive bias: dispositions arising from architecture or optimization that let the network prefer one solution over another. We first show that architectural choices such as the position of Layer Normalization (LN) strongly modulates grokking speed. This modulation is explained by isolating how LN on specific pathways shapes shortcut-learning and attention entropy. Subsequently, we study how different optimization settings modulate grokking, inducing distinct interpretations of previously proposed controls such as readout scale. Particularly, we find that using readout scale as a control for lazy training can be confounded by learning rate and weight decay in our setting. Accordingly, we show that features evolve continuously throughout training, suggesting grokking in transformers can be more nuanced than a lazy-to-rich transition of the learning regime. Finally, we show how generalization predictably emerges with feature compressibility in grokking, across different modulators of inductive bias. Our code is released at https://tinyurl.com/y52u3cad.