A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work investigates the optimization mechanism underlying “grokking”—the phenomenon in deep learning where test loss remains stagnant for an extended period after training loss reaches zero, followed by a sudden drop. We analyze gradient flow under small weight decay and uncover a two-timescale dynamical structure: an initial fast phase achieving perfect interpolation of training data, followed by a slow phase wherein the parameter ℓ₂-norm decays at rate ~1/λ, equivalent to Riemannian gradient descent on the unit sphere. This is the first purely optimization-theoretic framework for grokking, proving that generalization transitions arise from explicit norm contraction induced by weight decay—not implicit regularization. Leveraging singular perturbation theory and center-manifold analysis, we rigorously characterize the two-scale behavior. Empirical validation across multiple synthetic regression tasks confirms strict temporal alignment between norm decay and the abrupt test error reduction.

Technology Category

Application Category

📝 Abstract

We study the dynamics of gradient flow with small weight decay on general training losses $F: mathbb{R}^d o mathbb{R}$. Under mild regularity assumptions and assuming convergence of the unregularised gradient flow, we show that the trajectory with weight decay $lambda$ exhibits a two-phase behaviour as $lambda o 0$. During the initial fast phase, the trajectory follows the unregularised gradient flow and converges to a manifold of critical points of $F$. Then, at time of order $1/lambda$, the trajectory enters a slow drift phase and follows a Riemannian gradient flow minimising the $ell_2$-norm of the parameters. This purely optimisation-based phenomenon offers a natural explanation for the extit{grokking} effect observed in deep learning, where the training loss rapidly reaches zero while the test loss plateaus for an extended period before suddenly improving. We argue that this generalisation jump can be attributed to the slow norm reduction induced by weight decay, as explained by our analysis. We validate this mechanism empirically on several synthetic regression tasks.

Problem

Research questions and friction points this paper is trying to address.

Explains grokking via interpolation and Riemannian norm minimization

Analyzes gradient flow dynamics with small weight decay

Links generalization jump to slow norm reduction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient flow with weight decay

Riemannian norm minimisation

Two-phase training dynamics

🔎 Similar Papers

Unsupervised Machine Learning Hybrid Approach Integrating Linear Programming in Loss Function: A Robust Optimization Technique