🤖 AI Summary
This work establishes, for the first time, a rigorous convergence theory for the Adam optimizer in strongly convex stochastic optimization. Using tools from stochastic optimization, convex analysis, and probability theory, we conduct an asymptotic analysis of Adam’s iterative dynamics, precisely characterizing its convergence rate: $O(1/T)$ in gradient steps, $O(1/m)$ in mini-batch size, and $O(1/(1-eta_2))$ in the second-moment decay parameter $eta_2$. Our central contribution is the “Adam Symmetry Theorem”, which identifies symmetry of the data distribution as a necessary and sufficient condition for Adam to converge to the global optimum; under data asymmetry, Adam exhibits a systematic bias and fails to converge to the optimal solution. These theoretical findings are empirically validated: numerical experiments faithfully reproduce divergence under asymmetric data, exposing fundamental convergence limitations of Adam in practical deep learning tasks.
📝 Abstract
Beside the standard stochastic gradient descent (SGD) method, the Adam optimizer due to Kingma&Ba (2014) is currently probably the best-known optimization method for the training of deep neural networks in artificial intelligence (AI) systems. Despite the popularity and the success of Adam it remains an emph{open research problem} to provide a rigorous convergence analysis for Adam even for the class of strongly convex SOPs. In one of the main results of this work we establish convergence rates for Adam in terms of the number of gradient steps (convergence rate
icefrac{1}{2} w.r.t. the size of the learning rate), the size of the mini-batches (convergence rate 1 w.r.t. the size of the mini-batches), and the size of the second moment parameter of Adam (convergence rate 1 w.r.t. the distance of the second moment parameter to 1) for the class of strongly convex SOPs. In a further main result of this work, which we refer to as emph{Adam symmetry theorem}, we illustrate the optimality of the established convergence rates by proving for a special class of simple quadratic strongly convex SOPs that Adam converges as the number of gradient steps increases to infinity to the solution of the SOP (the unique minimizer of the strongly convex objective function) if and emph{only} if the random variables in the SOP (the data in the SOP) are emph{symmetrically distributed}. In particular, in the standard case where the random variables in the SOP are not symmetrically distributed we emph{disprove} that Adam converges to the minimizer of the SOP as the number of Adam steps increases to infinity. We also complement the conclusions of our convergence analysis and the Adam symmetry theorem by several numerical simulations that indicate the sharpness of the established convergence rates and that illustrate the practical appearance of the phenomena revealed in the emph{Adam symmetry theorem}.