🤖 AI Summary
Data parallelism in large language model (LLM) training suffers from scalability bottlenecks due to frequent inter-node synchronization. Method: This paper systematically investigates DiLoCo (Distributed Lookahead for LLMs) under fixed compute budgets, introducing a novel distributed optimization strategy that decouples parameter updates across model replicas. Contribution/Results: We establish the first quantitative scaling law for DiLoCo, characterizing the coupled effects of replica count, hyperparameters, and token budget on training efficacy. Empirical results demonstrate that DiLoCo outperforms standard data parallelism even for medium-scale models—achieving larger optimal batch sizes, lower validation loss, and superior downstream generalization. Crucially, its performance scales predictably and robustly with model size. This work provides both a theoretically grounded, scalable training paradigm and principled design guidelines for efficient distributed LLM training.
📝 Abstract
As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo's behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.