Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing averaging optimizers—such as DiLoCo (dual-loop, high-memory/parameter overhead) and Schedule-Free methods—exhibit limitations in non-distributed settings, including structural rigidity, excessive memory consumption, and hyperparameter complexity. Method: We propose Generalized Primal Averaging (GPA), the first optimizer to decouple the interpolation constant in Nesterov’s primal averaging, enabling smooth, stepwise iterative averaging without periodic dual-loop scheduling. GPA requires only a single cached variable, drastically reducing memory footprint and hyperparameter burden, while maintaining plug-and-play compatibility with base optimizers (e.g., AdamW). Contribution/Results: We prove GPA retains the standard $O(sqrt{T})$ regret bound without convergence degradation. Empirically, on Llama-160M, GPA achieves 24.22% speedup over AdamW at equivalent validation loss; on ImageNet with ViT, it accelerates training by 12% (small-batch) and 27% (large-batch), demonstrating broad applicability and efficiency gains.

Technology Category

Application Category

📝 Abstract

We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method in its primal averaging formulation that addresses key limitations of recent averaging-based optimizers such as single-worker DiLoCo and Schedule-Free (SF) in the non-distributed setting. These two recent algorithmic approaches improve the performance of base optimizers, such as AdamW, through different iterate averaging strategies. Schedule-Free explicitly maintains a uniform average of past weights, while single-worker DiLoCo performs implicit averaging by periodically aggregating trajectories, called pseudo-gradients, to update the model parameters. However, single-worker DiLoCo's periodic averaging introduces a two-loop structure, increasing its memory requirements and number of hyperparameters. GPA overcomes these limitations by decoupling the interpolation constant in the primal averaging formulation of Nesterov. This decoupling enables GPA to smoothly average iterates at every step, generalizing and improving upon single-worker DiLoCo. Empirically, GPA consistently outperforms single-worker DiLoCo while removing the two-loop structure, simplifying hyperparameter tuning, and reducing its memory overhead to a single additional buffer. On the Llama-160M model, GPA provides a 24.22% speedup in terms of steps to reach the baseline (AdamW's) validation loss. Likewise, GPA achieves speedups of 12% and 27% on small and large batch setups, respectively, to attain AdamW's validation accuracy on the ImageNet ViT workload. Furthermore, we prove that for any base optimizer with regret bounded by $O(sqrt{T})$, where $T$ is the number of iterations, GPA can match or exceed the convergence guarantee of the original optimizer, depending on the choice of interpolation constants.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of averaging-based optimizers like DiLoCo and Schedule-Free

Reduces memory overhead and hyperparameters in large language model training

Improves convergence speed and performance over AdamW on benchmark tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPA decouples interpolation constant for smooth averaging

Removes two-loop structure to simplify hyperparameter tuning

Reduces memory overhead to single additional buffer

🔎 Similar Papers

No similar papers found.