🤖 AI Summary
Adaptive optimizers such as Adam exhibit distinct generalization behavior compared to SGD, yet the underlying implicit regularization mechanism remains poorly understood.
Method: We model Adam, RMSProp, and Shampoo via continuous-time approximations and stochastic differential equations, revealing that Adam implicitly minimizes a novel sharpness measure—the trace of the square root of the diagonal Hessian—rather than the trace of the full Hessian minimized by SGD. This geometric distinction arises from the adaptive preconditioning inherent in these methods.
Contribution/Results: We prove that, in sparse linear regression with label noise, Adam strictly outperforms SGD in generalization. Empirically, Adam solutions exhibit enhanced sparsity and robustness. Our analysis provides a unified theoretical framework for adaptive optimization dynamics and establishes the first rigorous link between diagonal Hessian-based sharpness and implicit regularization. This work offers a new perspective on the implicit bias of adaptive methods and lays the foundation for designing sharpness-aware optimizers grounded in curvature geometry.
📝 Abstract
Despite the popularity of the Adam optimizer in practice, most theoretical analyses study Stochastic Gradient Descent (SGD) as a proxy for Adam, and little is known about how the solutions found by Adam differ. In this paper, we show that Adam implicitly reduces a unique form of sharpness measure shaped by its adaptive updates, leading to qualitatively different solutions from SGD. More specifically, when the training loss is small, Adam wanders around the manifold of minimizers and takes semi-gradients to minimize this sharpness measure in an adaptive manner, a behavior we rigorously characterize through a continuous-time approximation using stochastic differential equations. We further demonstrate how this behavior differs from that of SGD in a well-studied setting: when training overparameterized models with label noise, SGD has been shown to minimize the trace of the Hessian matrix, $ r(mH)$, whereas we prove that Adam minimizes $ r(Diag(mH)^{1/2})$ instead. In solving sparse linear regression with diagonal linear networks, this distinction enables Adam to achieve better sparsity and generalization than SGD. Finally, our analysis framework extends beyond Adam to a broad class of adaptive gradient methods, including RMSProp, Adam-mini, Adalayer and Shampoo, and provides a unified perspective on how these adaptive optimizers reduce sharpness, which we hope will offer insights for future optimizer design.