Why Adam Works Better with $\beta_1 = \beta_2$: The Missing Gradient Scale Invariance Principle

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work elucidates a long-observed empirical phenomenon wherein the Adam optimizer performs better when its first and second moment decay rates are equal (β₁ = β₂). By introducing the principle of gradient scale invariance, the study formally establishes and proves that Adam under this condition possesses first-order gradient scale invariance, aligning its update mechanism with scale-robust design principles. Through a combination of optimization-theoretic analysis, formal modeling, and cross-architecture experiments, the research demonstrates that setting β₁ = β₂ yields smoother sensitivity to gradient rescaling, thereby significantly enhancing training stability and validation performance. These benefits are consistently validated across diverse vision and language tasks.

Technology Category

Application Category

📝 Abstract
Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $\beta_{1}=\beta_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $\beta_{1}=\beta_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $\beta_{1}=\beta_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.
Problem

Research questions and friction points this paper is trying to address.

Adam optimizer
gradient scale invariance
momentum parameters
optimization
deep learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

gradient scale invariance
Adam optimizer
β₁ = β₂
scale-robust optimization
first-order invariance
🔎 Similar Papers
No similar papers found.
A
Alberto Fernández-Hernández
Universitat Politècnica de València, Valencia, Spain
C
Cristian Pérez-Corral
Universitat Politècnica de València, Valencia, Spain
J
José I. Mestre
Universitat Jaume I, Castelló de la Plana, Spain
M
M. F. Dolz
Universitat Jaume I, Castelló de la Plana, Spain
Enrique S. Quintana-Ortí
Enrique S. Quintana-Ortí
Universitat Politècnica de València, Spain