What happens when nanochat meets DiLoCo?

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study investigates a fundamental trade-off in communication-constrained distributed large language model (LLM) training: the irreversible representation drift induced by asynchronous local updates—exemplified by DiLoCo—and its detrimental impact on downstream performance. Using NanoChat, we build a lightweight, reproducible framework to systematically compare DiLoCo against standard data-parallel distributed training (DDP) across pretraining and downstream evaluation (MMLU, GSM8K, HumanEval). Results show that while DiLoCo achieves stable convergence and reduced communication overhead, it incurs significant and irreversible representation degradation, leading to persistent underperformance relative to DDP—even after subsequent synchronized fine-tuning. Crucially, this degradation cannot be recovered post-hoc. To our knowledge, this is the first empirical demonstration of the irreversibility of representation drift in low-communication training paradigms. Our findings provide critical theoretical insight and practical guidance for balancing communication efficiency against representational fidelity in distributed LLM training.

Technology Category

Application Category

📝 Abstract

Although LLM training is typically centralized with high-bandwidth interconnects and large compute budgets, emerging methods target communication-constrained training in distributed environments. The model trade-offs introduced by this shift remain underexplored, and our goal is to study them. We use the open-source nanochat project, a compact 8K-line full-stack ChatGPT-like implementation containing tokenization, pretraining, fine-tuning, and serving, as a controlled baseline. We implement the DiLoCo algorithm as a lightweight wrapper over nanochat's training loop, performing multiple local steps per worker before synchronization with an outer optimizer, effectively reducing communication by orders of magnitude. This inner-outer training is compared against a standard data-parallel (DDP) setup. Because nanochat is small and inspectable, it enables controlled pipeline adaptations and allows direct comparison with the conventional centralized baseline. DiLoCo achieves stable convergence and competitive loss in pretraining but yields worse MMLU, GSM8K, and HumanEval scores after mid-training and SFT. We discover that using DiLoCo-pretrained weights and running mid- and post-training with DDP fails to recover performance, revealing irreversible representation drift from asynchronous updates that impairs downstream alignment. We provide this implementation as an official fork of nanochat on GitHub.

Problem

Research questions and friction points this paper is trying to address.

Explores distributed LLM training with limited communication bandwidth

Compares DiLoCo algorithm against standard data-parallel training methods

Investigates irreversible representation drift from asynchronous training updates

Innovation

Methods, ideas, or system contributions that make the work stand out.

DiLoCo algorithm reduces communication during distributed training

Inner-outer optimization performs multiple local steps before synchronization

Lightweight wrapper enables controlled comparison with data-parallel baseline

🔎 Similar Papers

Are LLMs Good Cryptic Crossword Solvers?