To Grok Grokking: Provable Grokking in Ridge Regression

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work investigates the “grokking” phenomenon in over-parameterized linear models trained via gradient descent—wherein models suddenly transition from prolonged overfitting to perfect generalization. Within a ridge regression framework augmented with weight decay, the authors systematically identify three distinct phases: initial overfitting, an extended period of poor generalization, and eventual convergence of generalization error to zero. Their theoretical analysis establishes, for the first time, a precise quantitative relationship between the onset time of grokking and training hyperparameters, demonstrating that grokking can be controlled or even eliminated through careful hyperparameter tuning. Experiments further confirm that this mechanism persists in nonlinear neural networks, indicating that grokking arises from optimization dynamics rather than architectural deficiencies.

Technology Category

Application Category

📝 Abstract

We study grokking, the onset of generalization long after overfitting, in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the"grokking time") in terms of training hyperparameters. Lastly, going beyond the linear setting, we empirically demonstrate that our quantitative bounds also capture the behavior of grokking on non-linear neural networks. Our results suggest that grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions, and thus does not require fundamental changes to the model architecture or learning algorithm to avoid.

Problem

Research questions and friction points this paper is trying to address.

grokking

generalization delay

overfitting

ridge regression

gradient descent

Innovation

Methods, ideas, or system contributions that make the work stand out.

grokking

ridge regression

over-parameterization