On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions

📅 2024-02-06
🏛️ Neural Information Processing Systems
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the weak theoretical foundation of Adam for non-convex smooth optimization. Under relaxed assumptions—where gradients may be unbounded and stochastic noise follows an affine-variance model (encompassing bounded, sub-Gaussian, and other common noise types)—we establish a unified noise framework and provide high-probability convergence analysis. Methodologically, we integrate stochastic optimization theory, probabilistic inequalities, and non-convex analysis, avoiding strong assumptions such as gradient boundedness or fixed smoothness constants; we introduce a generalized smoothness notion accommodating variable (even unbounded) Lipschitz constants. Our theoretical contributions are: (1) proving that vanilla Adam—without hyperparameter tuning—converges to a stationary point at rate $O(mathrm{poly}(log T)/sqrt{T})$, which is optimal up to logarithmic factors; (2) delivering the first high-probability convergence guarantee for Adam applicable to unbounded gradients and generalized smooth objectives, with adaptivity strictly superior to SGD.

Technology Category

Application Category

📝 Abstract
The Adaptive Momentum Estimation (Adam) algorithm is highly effective in training various deep learning tasks. Despite this, there's limited theoretical understanding for Adam, especially when focusing on its vanilla form in non-convex smooth scenarios with potential unbounded gradients and affine variance noise. In this paper, we study vanilla Adam under these challenging conditions. We introduce a comprehensive noise model which governs affine variance noise, bounded noise and sub-Gaussian noise. We show that Adam can find a stationary point with a $mathcal{O}( ext{poly}(log T)/sqrt{T})$ rate in high probability under this general noise model where $T$ denotes total number iterations, matching the lower rate of stochastic first-order algorithms up to logarithm factors. More importantly, we reveal that Adam is free of tuning step-sizes with any problem-parameters, yielding a better adaptation property than the Stochastic Gradient Descent under the same conditions. We also provide a probabilistic convergence result for Adam under a generalized smooth condition which allows unbounded smoothness parameters and has been illustrated empirically to more accurately capture the smooth property of many practical objective functions.
Problem

Research questions and friction points this paper is trying to address.

Analyzing Adam's convergence under unbounded gradients and noise
Establishing convergence rate for non-convex optimization with relaxed assumptions
Demonstrating parameter-free step-size adaptation in stochastic optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adam algorithm under relaxed noise assumptions
Convergence with poly-logarithmic rate without step-size tuning
Generalized smooth condition allowing unbounded parameters
Yusu Hong
Yusu Hong
Center for Data Science, Zhejiang University, Hangzhou, P.R. China
J
Junhong Lin
Center for Data Science, Zhejiang University, Hangzhou, P.R. China