Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Although Adam often converges faster than SGD in practice, existing theory struggles to explain this gap. This work addresses the issue by uncovering the critical role of second-moment normalization in Adam and, through a novel combination of stopping time arguments and martingale analysis under the standard bounded variance assumption, establishes the first high-probability convergence separation between the two optimizers. Specifically, the dependence of Adam’s convergence bound on the confidence parameter δ scales as δ⁻¹/², significantly improving upon SGD’s δ⁻¹ dependence. This result not only rigorously distinguishes the convergence behaviors of Adam and SGD but also provides the first high-probability theoretical guarantee for Adam’s empirical superiority.

Technology Category

Application Category

📝 Abstract

Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a $δ^{-1/2}$ dependence on the confidence parameter $δ$, whereas corresponding high-probability guarantee for SGD necessarily incurs at least a $δ^{-1}$ dependence.

Problem

Research questions and friction points this paper is trying to address.

Adam

SGD

convergence

high-probability

second-moment

Innovation

Methods, ideas, or system contributions that make the work stand out.

second-moment normalization

Adam optimizer

high-probability convergence