Adam Converges Without Any Modification On Update Rules

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the long-standing convergence controversy surrounding the Adam optimizer by rigorously analyzing its divergence under certain conditions. Through theoretical analysis, we uncover—for the first time—a problem- and batch-size-dependent phase transition boundary in the (β₁, β₂) parameter plane that separates convergence from divergence. We formally prove that Adam is guaranteed to converge when the hyperparameters satisfy β₁ < √β₂ and β₂ scales inversely with the batch size. This result not only provides a solid theoretical foundation for the stable application of Adam but also yields a practical tuning guideline: increasing β₂ while maintaining β₁ < √β₂ significantly enhances training stability in large language models, consistent with multiple empirical studies.

Technology Category

Application Category

📝 Abstract

Adam is the default algorithm for training neural networks, including large language models (LLMs). However, \citet{reddi2019convergence} provided an example that Adam diverges, raising concerns for its deployment in AI model training. We identify a key mismatch between the divergence example and practice: \citet{reddi2019convergence} pick the problem after picking the hyperparameters of Adam, i.e., $(β_1,β_2)$; while practical applications often fix the problem first and then tune $(β_1,β_2)$. In this work, we prove that Adam converges with proper problem-dependent hyperparameters. First, we prove that Adam converges when $β_2$ is large and $β_1 < \sqrt{β_2}$. Second, when $β_2$ is small, we point out a region of $(β_1,β_2)$ combinations where Adam can diverge to infinity. Our results indicate a phase transition for Adam from divergence to convergence when changing the $(β_1, β_2)$ combination. To our knowledge, this is the first phase transition in $(β_1,β_2)$ 2D-plane reported in the literature, providing rigorous theoretical guarantees for Adam optimizer. We further point out that the critical boundary $(β_1^*, β_2^*)$ is problem-dependent, and particularly, dependent on batch size. This provides suggestions on how to tune $β_1$ and $β_2$: when Adam does not work well, we suggest tuning up $β_2$ inversely with batch size to surpass the threshold $β_2^*$, and then trying $β_1< \sqrt{β_2}$. Our suggestions are supported by reports from several empirical studies, which observe improved LLM training performance when applying them.

Problem

Research questions and friction points this paper is trying to address.

Adam optimizer

convergence

divergence

hyperparameter tuning

phase transition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adam optimizer

convergence proof

phase transition