π€ AI Summary
This study investigates the transition from memorization to generalization in neural networks trained on modular arithmetic tasks, with a focus on the global structural dynamics underlying the grokking phenomenon. By integrating causal analysis, spectral methods, algorithmic complexity measures, and singular learning theory, the work demonstrates that generalization arises from the modelβs spontaneous collapse onto a low-dimensional manifold of redundant parameters, driven by an implicit bias toward simplicity and accompanied by deep information compression. Offering the first explanation of grokking through the lens of global structural evolution, this research transcends prior paradigms confined to local circuit mechanisms or optimization dynamics, thereby establishing a novel theoretical framework for understanding the fundamental nature of overfitting and generalization in deep learning.
π Abstract
Grokking in modular arithmetic has established itself as the quintessential fruit fly experiment, serving as a critical domain for investigating the mechanistic origins of model generalization. Despite its significance, existing research remains narrowly focused on specific local circuits or optimization tuning, largely overlooking the global structural evolution that fundamentally drives this phenomenon. We propose that grokking originates from a spontaneous simplification of internal model structures governed by the principle of parsimony. We integrate causal, spectral, and algorithmic complexity measures alongside Singular Learning Theory to reveal that the transition from memorization to generalization corresponds to the physical collapse of redundant manifolds and deep information compression, offering a novel perspective for understanding the mechanisms of model overfitting and generalization.