Why Adam Works Better with $\beta_1 = \beta_2$: The Missing Gradient Scale Invariance Principle

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work elucidates a long-observed empirical phenomenon wherein the Adam optimizer performs better when its first and second moment decay rates are equal (β₁ = β₂). By introducing the principle of gradient scale invariance, the study formally establishes and proves that Adam under this condition possesses first-order gradient scale invariance, aligning its update mechanism with scale-robust design principles. Through a combination of optimization-theoretic analysis, formal modeling, and cross-architecture experiments, the research demonstrates that setting β₁ = β₂ yields smoother sensitivity to gradient rescaling, thereby significantly enhancing training stability and validation performance. These benefits are consistently validated across diverse vision and language tasks.

Technology Category

Application Category

📝 Abstract

Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $\beta_{1}=\beta_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $\beta_{1}=\beta_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $\beta_{1}=\beta_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.

Problem

Research questions and friction points this paper is trying to address.

Adam optimizer

gradient scale invariance

momentum parameters

optimization

deep learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

gradient scale invariance

Adam optimizer

β₁ = β₂

scale-robust optimization

first-order invariance

🔎 Similar Papers

No similar papers found.

Authors to Follow