🤖 AI Summary
Although Adam often converges faster than SGD in practice, existing theory struggles to explain this gap. This work addresses the issue by uncovering the critical role of second-moment normalization in Adam and, through a novel combination of stopping time arguments and martingale analysis under the standard bounded variance assumption, establishes the first high-probability convergence separation between the two optimizers. Specifically, the dependence of Adam’s convergence bound on the confidence parameter δ scales as δ⁻¹/², significantly improving upon SGD’s δ⁻¹ dependence. This result not only rigorously distinguishes the convergence behaviors of Adam and SGD but also provides the first high-probability theoretical guarantee for Adam’s empirical superiority.
📝 Abstract
Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a $δ^{-1/2}$ dependence on the confidence parameter $δ$, whereas corresponding high-probability guarantee for SGD necessarily incurs at least a $δ^{-1}$ dependence.