SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

To address optimization efficiency bottlenecks of large language models (LLMs) in non-distributed training, this paper proposes SNOO—a novel optimizer based on a Lookahead-inspired two-loop architecture. Methodologically, SNOO explicitly applies Nesterov momentum to pseudo-gradients (rather than true gradients), and integrates pseudo-gradient averaging with inner-loop optimization (e.g., AdamW or Muon) to update slow weights. The design is fully compatible with model sharding, incurs negligible overhead, and exhibits scalable performance gains with increasing model size. Empirically, across training scales up to 1e23 FLOPs, SNOO achieves 1.5–2.5× computational speedup over AdamW—measured as FLOPs-to-loss reduction—demonstrating substantial acceleration for large-model training in single-machine, non-distributed settings.

Technology Category

Application Category

📝 Abstract

The rapid development of large language models (LLMs) has driven the demand for more efficient optimization techniques. Among these, the Lookahead family of optimizers employs a two-loop framework, maintaining fast and slow sets of model weights. Multiple inner optimizer steps on the fast weights produce a trajectory - the pseudo-gradient - that is used to update the slow weights. DiLoCo, a notable example originally designed for distributed training, applies Nesterov momentum to the averaged pseudo-gradient from multiple workers, claiming to even outperform AdamW in a non-distributed setup. In this paper, we empirically show that DiLoCo's surprising effectiveness stems primarily from applying Nesterov momentum to the pseudo-gradient, which improves training in a non-distributed setting. We call this Lookahead variant the Step-$K$ Nesterov Outer Optimizer (SNOO). We demonstrate that SNOO achieves compute factor gains of 1.5 - 2.5$ imes$ in a non-distributed setting up to a scale of 1e23 training FLOPs, with improvements that increase with model size. Because of its minimal compute and memory overhead and compatibility with model sharding, SNOO is a practical enhancement for a variety of inner optimizers, including AdamW and Muon.

Problem

Research questions and friction points this paper is trying to address.

Improving optimization efficiency for large language models training

Applying Nesterov momentum to pseudo-gradients in Lookahead optimizers

Achieving computational gains with minimal overhead in non-distributed settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Applying Nesterov momentum to pseudo-gradients

Using Step-K Nesterov Outer Optimizer framework

Achieving compute gains with minimal overhead

🔎 Similar Papers

Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks