🤖 AI Summary
This study investigates how label noise influences the generalization of over-parameterized two-layer linear networks trained with stochastic gradient descent (SGD), revealing its role in driving a transition from lazy to rich learning regimes. Through dynamical systems analysis and experiments on both synthetic and real-world datasets, the authors identify a two-phase training dynamics under noisy SGD: an initial phase where the model escapes the lazy regime (Phase I), followed by a phase where weights progressively align with the ground-truth solution (Phase II). This work is the first to explicitly characterize the mechanism by which label noise triggers this shift in learning behavior and extends the insight to other optimizers such as Sharpness-Aware Minimization. Both theoretical and empirical results consistently demonstrate that label noise can enhance generalization, offering a novel perspective on the implicit bias of SGD.
📝 Abstract
One crucial factor behind the success of deep learning lies in the implicit bias induced by noise inherent in gradient-based training algorithms. Motivated by empirical observations that training with noisy labels improves model generalization, we delve into the underlying mechanisms behind stochastic gradient descent (SGD) with label noise. Focusing on a two-layer over-parameterized linear network, we analyze the learning dynamics of label noise SGD, unveiling a two-phase learning behavior. In \emph{Phase I}, the magnitudes of model weights progressively diminish, and the model escapes the lazy regime; enters the rich regime. In \emph{Phase II}, the alignment between model weights and the ground-truth interpolator increases, and the model eventually converges. Our analysis highlights the critical role of label noise in driving the transition from the lazy to the rich regime and minimally explains its empirical success. Furthermore, we extend these insights to Sharpness-Aware Minimization (SAM), showing that the principles governing label noise SGD also apply to broader optimization algorithms. Extensive experiments, conducted under both synthetic and real-world setups, strongly support our theory. Our code is released at https://github.com/a-usually/Label-Noise-SGD.