Scaling the Memory of Balanced Adam

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the common misconception that the momentum parameter β in the Adam optimizer is a dimensionless constant lacking theoretical grounding, which undermines training robustness. The authors reinterpret β as a variable governing the effective duration of statistical memory and introduce a refresh-count criterion defined as R_β = (1 − β)T_ES, where T_ES denotes the effective learning span and H_β = (1 − β)⁻¹ represents the memory horizon. They recommend adaptively setting R_β ≈ 1000. Evaluated across 11 vision and language tasks, this approach substantially improves training stability: compared to the best fixed β = 0.94377, it reduces the worst-case validation gap by 33.4%, with all results falling within 1% of the validation oracle performance.

📝 Abstract

Recent evidence suggests that Adam performs robustly when its momentum parameters are tied, $β_1=β_2$, reducing the optimizer to a single remaining parameter. However, the value of this parameter is still poorly understood. We argue that, in balanced Adam, $β$ should not be treated as a dimensionless constant: it defines a statistical memory horizon $H_β=(1-β)^{-1}$. In terms of the effective learning horizon $T_{\mathrm{ES}}$, estimated from the validation trajectory, we study the refresh count $R_β=(1-β)T_{\mathrm{ES}}$, which measures how many times Adam renews its internal statistics during the useful phase of training. Across 11 vision and language experiments, we find that choosing $β$ so that $R_β\approx1000$ selects different beta values depending on the training scale, yet improves robustness over the best fixed-beta baseline. Compared with the strongest fixed choice $β=0.94377$, the refresh rule improves worst-case robustness, reducing the global maximum validation gap by $33.4\%$, while bringing all 11 runs within $1\%$ of their validation oracle. These results suggest that the remaining hyperparameter of balanced Adam is better understood as a memory-scale variable than as a fixed constant. This provides a simple budget-aware perspective on optimizer scaling and opens a path toward treating Adam's momentum as part of the learning dynamics rather than as a static default.

Problem

Research questions and friction points this paper is trying to address.

Balanced Adam

memory horizon

hyperparameter scaling

optimizer robustness

momentum parameter

Innovation

Methods, ideas, or system contributions that make the work stand out.

Balanced Adam

memory horizon

refresh count

optimizer scaling

adaptive beta

🔎 Similar Papers

Adam-mini: Use Fewer Learning Rates To Gain More

2024-06-24arXiv.orgCitations: 21

💼 Related Jobs

Research Engineer, Monetization AI