Grokking vs. Learning: Same Features, Different Encodings

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

254K/year

🤖 AI Summary

This work investigates fundamental differences between grokking—characterized by sudden generalization—and standard training, focusing on learning dynamics and representational properties. Methodologically, we employ feature analysis, compressibility assessment, information-geometric modeling, and dynamic trajectory tracking. We find that both paradigms learn identical task-relevant features but differ markedly in encoding efficiency: standard training exhibits a “compressibility trade-off region,” whereas grokking evolves along a straight path in information space. To formalize this, we introduce a novel information-geometric metric. Experiments show standard-trained models achieve up to 25× compression over baseline—five times greater than grokking models—and grokking attains peak compressibility immediately post-generalization. Our core contributions are threefold: (i) the first identification of the compressibility trade-off mechanism in standard training; (ii) establishment of an information-geometric representation framework for grokking; and (iii) empirical validation of grokking’s highly efficient, temporally precise encoding.

Technology Category

Application Category

📝 Abstract

Grokking typically achieves similar loss to ordinary,"steady", learning. We ask whether these different learning paths - grokking versus ordinary training - lead to fundamental differences in the learned models. To do so we compare the features, compressibility, and learning dynamics of models trained via each path in two tasks. We find that grokked and steadily trained models learn the same features, but there can be large differences in the efficiency with which these features are encoded. In particular, we find a novel"compressive regime"of steady training in which there emerges a linear trade-off between model loss and compressibility, and which is absent in grokking. In this regime, we can achieve compression factors 25x times the base model, and 5x times the compression achieved in grokking. We then track how model features and compressibility develop through training. We show that model development in grokking is task-dependent, and that peak compressibility is achieved immediately after the grokking plateau. Finally, novel information-geometric measures are introduced which demonstrate that models undergoing grokking follow a straight path in information space.

Problem

Research questions and friction points this paper is trying to address.

Compare grokking and steady learning paths

Analyze feature encoding efficiency differences

Introduce novel compressive regime in steady training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares grokking and steady training

Introduces novel compressive regime

Uses information-geometric measures for analysis

🔎 Similar Papers

No similar papers found.