On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work investigates the impact of preconditioned gradient descent—such as Gauss–Newton methods—on spectral bias (the tendency to learn low-frequency components first) and grokking (delayed generalization) in neural network training. By integrating Neural Tangent Kernel (NTK) theory with empirical analysis, the study demonstrates that preconditioning promotes more uniform exploration in parameter space, thereby effectively mitigating spectral bias and substantially accelerating the grokking phenomenon. The findings further reveal how preconditioned optimization enables a smooth transition of the learning dynamics from the NTK lazy regime, where features remain nearly fixed, to a feature-rich regime characterized by active representation learning. This provides novel insights into the interplay between optimization dynamics and phase transitions in generalization behavior.

Technology Category

Application Category

📝 Abstract

Spectral bias, the tendency of neural networks to learn low frequencies first, can be both a blessing and a curse. While it enhances the generalization capabilities by suppressing high-frequency noise, it can be a limitation in scientific tasks that require capturing fine-scale structures. The delayed generalization phenomenon known as grokking is another barrier to rapid training of neural networks. Grokking has been hypothesized to arise as learning transitions from the NTK to the feature-rich regime. This paper explores the impact of preconditioned gradient descent (PGD), such as Gauss-Newton, on spectral bias and grokking phenomena. We demonstrate through theoretical and empirical results how PGD can mitigate issues associated with spectral bias. Additionally, building on the rich learning regime grokking hypothesis, we study how PGD can be used to reduce delays associated with grokking. Our conjecture is that PGD, without the impediment of spectral bias, enables uniform exploration of the parameter space in the NTK regime. Our experimental results confirm this prediction, providing strong evidence that grokking represents a transitional behavior between the lazy regime characterized by the NTK and the rich regime. These findings deepen our understanding of the interplay between optimization dynamics, spectral bias, and the phases of neural network learning.

Problem

Research questions and friction points this paper is trying to address.

spectral bias

grokking

neural network training

generalization delay

fine-scale structures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Preconditioned Gradient Descent

Spectral Bias

Grokking