🤖 AI Summary
This work addresses the limitation of fixed hyperparameters in large-scale language model training, which hinders optimizers from adapting to the power-law structure of data. The authors propose ADANA, a novel optimizer that uniquely integrates logarithmic-time scheduling with an explicit damping mechanism to dynamically adjust AdamW’s hyperparameters (β₁, β₂, λ). This approach extends the gradient memory window and enhances optimization stability, breaking away from the conventional fixed-hyperparameter paradigm. Evaluated across models ranging from 45M to 2.6B parameters, ADANA achieves up to a 40% improvement in computational efficiency over carefully tuned AdamW, with performance gains amplifying as model scale increases. Notably, the logarithmic-time weight decay component alone already yields significant benefits.
📝 Abstract
In practice, the hyperparameters $(\beta_1, \beta_2)$ and weight-decay $\lambda$ in AdamW are typically kept at fixed values. Is there any reason to do otherwise? We show that for large-scale language model training, the answer is yes: by exploiting the power-law structure of language data, one can design time-varying schedules for $(\beta_1, \beta_2, \lambda)$ that deliver substantial performance gains. We study logarithmic-time scheduling, in which the optimizer's gradient memory horizon grows with training time. Although naive variants of this are unstable, we show that suitable damping mechanisms restore stability while preserving the benefits of longer memory. Based on this, we present ADANA, an AdamW-like optimizer that couples log-time schedules with explicit damping to balance stability and performance. We empirically evaluate ADANA across transformer scalings (45M to 2.6B parameters), comparing against AdamW, Muon, and AdEMAMix. When properly tuned, ADANA achieves up to 40% compute efficiency relative to a tuned AdamW, with gains that persist--and even improve--as model scale increases. We further show that similar benefits arise when applying logarithmic-time scheduling to AdEMAMix, and that logarithmic-time weight-decay alone can yield significant improvements. Finally, we present variants of ADANA that mitigate potential failure modes and improve robustness.