🤖 AI Summary
This work addresses two key challenges: the absence of grokking—i.e., delayed generalization—under low-data regimes (below a critical threshold), and generalization failure under distributional shift. To tackle these, we propose a knowledge distillation–based cross-distribution insight transfer method. Our core contribution is the discovery that a pre-grokked teacher model can effectively distill “grokking capability” into a student model, enabling rapid emergence of delayed generalization on novel distributions—even with as little as 10% training data. Further, we integrate this approach into a continual pre-training framework that jointly mitigates data scarcity, overfitting, and catastrophic forgetting during sequential task learning and joint distribution modeling. Experiments demonstrate that our method significantly accelerates the grokking process, enhances generalization stability, and improves robustness across distribution shifts. This establishes a novel paradigm for efficient learning in few-shot and dynamically shifting distribution settings.
📝 Abstract
In this paper, we investigate the phenomenon of grokking, where models exhibit delayed generalization following overfitting on training data. We focus on data-scarce regimes where the number of training samples falls below the critical threshold, making grokking unobservable, and on practical scenarios involving distribution shift. We first show that Knowledge Distillation (KD) from a model that has already grokked on a distribution (p1) can induce and accelerate grokking on a different distribution (p2), even when the available data lies below the critical threshold. This highlights the value of KD for deployed models that must adapt to new distributions under limited data. We then study training on the joint distribution (p1, p2) and demonstrate that while standard supervised training fails when either distribution has insufficient data, distilling from models grokked on the individual distributions enables generalization. Finally, we examine a continual pretraining setup, where a grokked model transitions from p1 to p2, and find that KD both accelerates generalization and mitigates catastrophic forgetting, achieving strong performance even with only 10% of the data. Together, our results provide new insights into the mechanics of grokking under knowledge transfer and underscore the central role of KD in enabling generalization in low-data and evolving distribution settings.