🤖 AI Summary
Schedule-Free optimizers, AdEMAMix, and noise-dominated accelerated SGD variants exhibit disparate formulations, yet share an underlying structural principle—decoupling of momentum coefficients from gradient weighting.
Method: We propose the first unified theoretical framework encompassing these diverse state-of-the-art optimizers and introduce Simplified-AdEMAMix, a streamlined variant that eliminates the dual-momentum mechanism while preserving full-batch convergence rates and drastically reducing implementation complexity.
Contribution/Results: Our theoretical analysis establishes that such decoupling yields fundamental acceleration under high-gradient noise. Empirical evaluation on a 150M-parameter language model demonstrates that Simplified-AdEMAMix matches AdEMAMix’s performance across both small- and large-batch regimes. The open-sourced implementation validates the practical efficacy and deployability of noise-driven acceleration mechanisms in real-world training.
📝 Abstract
Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In this paper, we establish explicit connections between these two lines of work. We substantiate our theoretical findings with preliminary experiments on a 150m language modeling task. We find that AdEMAMix, which most closely resembles accelerated versions of stochastic gradient descent, exhibits superior performance. Building on these insights, we introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance as AdEMAMix across both large and small batch-size settings while eliminating the need for two different momentum terms. The code for Simplified-AdEMAMix is available on the repository: https://github.com/DepenM/Simplified-AdEMAMix/.