Grokking Beyond the Euclidean Norm of Model Parameters

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how regularization induces the “grokking” phenomenon—delayed generalization—in neural network training. Method: Through theoretical analysis and extensive experiments, the authors examine both explicit (e.g., ℓ₁- and nuclear-norm) and implicit regularization schemes that encode structural priors (e.g., sparsity, low-rankness), and systematically study their role in grokking dynamics. Contribution/Results: They demonstrate that structural regularization reliably triggers grokking; remarkably, depth-induced overparameterization alone suffices to produce either grokking or un-grokking without explicit regularization. They further show that the ℓ₂ norm loses its reliability as a proxy for generalization performance during grokking. Crucially, they provide the first empirical evidence that grokking can be significantly amplified solely via strategic subset selection from the training data. The study reproduces, controls, and explains grokking across multiple tasks, offering novel theoretical insights and empirical foundations for understanding generalization in deep learning.

Technology Category

Application Category

📝 Abstract
Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. In this work, we demonstrate that grokking can be induced by regularization, either explicit or implicit. More precisely, we show that when there exists a model with a property $P$ (e.g., sparse or low-rank weights) that generalizes on the problem of interest, gradient descent with a small but non-zero regularization of $P$ (e.g., $ell_1$ or nuclear norm regularization) results in grokking. This extends previous work showing that small non-zero weight decay induces grokking. Moreover, our analysis shows that over-parameterization by adding depth makes it possible to grok or ungrok without explicitly using regularization, which is impossible in shallow cases. We further show that the $ell_2$ norm is not a reliable proxy for generalization when the model is regularized toward a different property $P$, as the $ell_2$ norm grows in many cases where no weight decay is used, but the model generalizes anyway. We also show that grokking can be amplified solely through data selection, with any other hyperparameter fixed.
Problem

Research questions and friction points this paper is trying to address.

Investigates grokking induced by regularization in neural networks
Explores over-parameterization's role in grokking without explicit regularization
Examines ℓ2 norm reliability as a generalization proxy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Regularization induces grokking in neural networks
Over-parameterization enables grokking without explicit regularization
Data selection amplifies grokking with fixed hyperparameters
P
Pascal Jr Tikeng Notsawo
DIRO, Université de Montréal, Montreal, Quebec, Canada; Mila, Quebec AI Institute, Montreal, Quebec, Canada; CHU Sainte-Justine Research Center, Montreal, Quebec, Canada
Guillaume Dumas
Guillaume Dumas
Associate Professor, CHUSJ/Mila, University of Montreal
HyperscanningNeurodynamicsPrecision PsychiatrySocial AISciML
Guillaume Rabusseau
Guillaume Rabusseau
Assistant Professor - Canada CIFAR AI Chair, Université de Montréal / Mila
Machine LearningTensorsWeighted AutomataTensor Networks