🤖 AI Summary
This work elucidates a long-observed empirical phenomenon wherein the Adam optimizer performs better when its first and second moment decay rates are equal (β₁ = β₂). By introducing the principle of gradient scale invariance, the study formally establishes and proves that Adam under this condition possesses first-order gradient scale invariance, aligning its update mechanism with scale-robust design principles. Through a combination of optimization-theoretic analysis, formal modeling, and cross-architecture experiments, the research demonstrates that setting β₁ = β₂ yields smoother sensitivity to gradient rescaling, thereby significantly enhancing training stability and validation performance. These benefits are consistently validated across diverse vision and language tasks.
📝 Abstract
Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $\beta_{1}=\beta_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $\beta_{1}=\beta_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $\beta_{1}=\beta_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.