🤖 AI Summary
To address optimization efficiency bottlenecks of large language models (LLMs) in non-distributed training, this paper proposes SNOO—a novel optimizer based on a Lookahead-inspired two-loop architecture. Methodologically, SNOO explicitly applies Nesterov momentum to pseudo-gradients (rather than true gradients), and integrates pseudo-gradient averaging with inner-loop optimization (e.g., AdamW or Muon) to update slow weights. The design is fully compatible with model sharding, incurs negligible overhead, and exhibits scalable performance gains with increasing model size. Empirically, across training scales up to 1e23 FLOPs, SNOO achieves 1.5–2.5× computational speedup over AdamW—measured as FLOPs-to-loss reduction—demonstrating substantial acceleration for large-model training in single-machine, non-distributed settings.
📝 Abstract
The rapid development of large language models (LLMs) has driven the demand for more efficient optimization techniques. Among these, the Lookahead family of optimizers employs a two-loop framework, maintaining fast and slow sets of model weights. Multiple inner optimizer steps on the fast weights produce a trajectory - the pseudo-gradient - that is used to update the slow weights. DiLoCo, a notable example originally designed for distributed training, applies Nesterov momentum to the averaged pseudo-gradient from multiple workers, claiming to even outperform AdamW in a non-distributed setup. In this paper, we empirically show that DiLoCo's surprising effectiveness stems primarily from applying Nesterov momentum to the pseudo-gradient, which improves training in a non-distributed setting. We call this Lookahead variant the Step-$K$ Nesterov Outer Optimizer (SNOO). We demonstrate that SNOO achieves compute factor gains of 1.5 - 2.5$ imes$ in a non-distributed setting up to a scale of 1e23 training FLOPs, with improvements that increase with model size. Because of its minimal compute and memory overhead and compatibility with model sharding, SNOO is a practical enhancement for a variety of inner optimizers, including AdamW and Muon.