π€ AI Summary
This study investigates the dynamical mechanisms underlying grokkingβthe abrupt transition from memorization to generalization in neural networks. Through finite-size scaling analysis and a gradient avalanche model, it reveals for the first time that grokking is fundamentally a dimensionality phase transition driven by the geometry of the gradient field: the effective dimensionality \( D(t) \) jumps from a subdiffusive regime (\( D < 1 \)) to a superdiffusive one (\( D > 1 \)) precisely at the onset of generalization. This transition exhibits self-organized criticality and remains robust across diverse network topologies. Although backpropagation-induced correlations in real training scenarios lead to dimensional overshooting, the critical behavior of \( D(t) \) crossing unity persists unchanged. The work thus unifies grokking as a universal, architecture-agnostic dynamical phenomenon.
π Abstract
Neural network grokking -- the abrupt memorization-to-generalization transition -- challenges our understanding of learning dynamics. Through finite-size scaling of gradient avalanche dynamics across eight model scales, we find that grokking is a \textit{dimensional phase transition}: effective dimensionality~$D$ crosses from sub-diffusive (subcritical, $D < 1$) to super-diffusive (supercritical, $D > 1$) at generalization onset, exhibiting self-organized criticality (SOC). Crucially, $D$ reflects \textbf{gradient field geometry}, not network architecture: synthetic i.i.d.\ Gaussian gradients maintain $D \approx 1$ regardless of graph topology, while real training exhibits dimensional excess from backpropagation correlations. The grokking-localized $D(t)$ crossing -- robust across topologies -- offers new insight into the trainability of overparameterized networks.