🤖 AI Summary
This work investigates fundamental differences between grokking—characterized by sudden generalization—and standard training, focusing on learning dynamics and representational properties. Methodologically, we employ feature analysis, compressibility assessment, information-geometric modeling, and dynamic trajectory tracking. We find that both paradigms learn identical task-relevant features but differ markedly in encoding efficiency: standard training exhibits a “compressibility trade-off region,” whereas grokking evolves along a straight path in information space. To formalize this, we introduce a novel information-geometric metric. Experiments show standard-trained models achieve up to 25× compression over baseline—five times greater than grokking models—and grokking attains peak compressibility immediately post-generalization. Our core contributions are threefold: (i) the first identification of the compressibility trade-off mechanism in standard training; (ii) establishment of an information-geometric representation framework for grokking; and (iii) empirical validation of grokking’s highly efficient, temporally precise encoding.
📝 Abstract
Grokking typically achieves similar loss to ordinary,"steady", learning. We ask whether these different learning paths - grokking versus ordinary training - lead to fundamental differences in the learned models. To do so we compare the features, compressibility, and learning dynamics of models trained via each path in two tasks. We find that grokked and steadily trained models learn the same features, but there can be large differences in the efficiency with which these features are encoded. In particular, we find a novel"compressive regime"of steady training in which there emerges a linear trade-off between model loss and compressibility, and which is absent in grokking. In this regime, we can achieve compression factors 25x times the base model, and 5x times the compression achieved in grokking. We then track how model features and compressibility develop through training. We show that model development in grokking is task-dependent, and that peak compressibility is achieved immediately after the grokking plateau. Finally, novel information-geometric measures are introduced which demonstrate that models undergoing grokking follow a straight path in information space.