🤖 AI Summary
This work addresses the common misconception that the momentum parameter β in the Adam optimizer is a dimensionless constant lacking theoretical grounding, which undermines training robustness. The authors reinterpret β as a variable governing the effective duration of statistical memory and introduce a refresh-count criterion defined as R_β = (1 − β)T_ES, where T_ES denotes the effective learning span and H_β = (1 − β)⁻¹ represents the memory horizon. They recommend adaptively setting R_β ≈ 1000. Evaluated across 11 vision and language tasks, this approach substantially improves training stability: compared to the best fixed β = 0.94377, it reduces the worst-case validation gap by 33.4%, with all results falling within 1% of the validation oracle performance.
📝 Abstract
Recent evidence suggests that Adam performs robustly when its momentum parameters are tied, $β_1=β_2$, reducing the optimizer to a single remaining parameter. However, the value of this parameter is still poorly understood. We argue that, in balanced Adam, $β$ should not be treated as a dimensionless constant: it defines a statistical memory horizon $H_β=(1-β)^{-1}$. In terms of the effective learning horizon $T_{\mathrm{ES}}$, estimated from the validation trajectory, we study the refresh count $R_β=(1-β)T_{\mathrm{ES}}$, which measures how many times Adam renews its internal statistics during the useful phase of training. Across 11 vision and language experiments, we find that choosing $β$ so that $R_β\approx1000$ selects different beta values depending on the training scale, yet improves robustness over the best fixed-beta baseline. Compared with the strongest fixed choice $β=0.94377$, the refresh rule improves worst-case robustness, reducing the global maximum validation gap by $33.4\%$, while bringing all 11 runs within $1\%$ of their validation oracle. These results suggest that the remaining hyperparameter of balanced Adam is better understood as a memory-scale variable than as a fixed constant. This provides a simple budget-aware perspective on optimizer scaling and opens a path toward treating Adam's momentum as part of the learning dynamics rather than as a static default.