Decoupled DiLoCo for Resilient Distributed Pre-training

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the vulnerability of large-scale language model pretraining under the SPMD paradigm to hardware failures, transient slowdowns, and synchronization overhead, which often lead to global training halts and wasted computational resources. To overcome these limitations, the authors propose Decoupled DiLoCo, a framework that decouples training into multiple asynchronous learners coordinated by a central synchronizer, thereby eliminating strict lockstep synchronization. The approach integrates a quorum-based minimum participation requirement, adaptive grace periods, and dynamic token-weighted fusion, combined with parameter-sharded communication and a chaos-engineering-inspired fault-tolerance strategy. Evaluated in simulated million-chip environments, Decoupled DiLoCo achieves strictly zero global downtime while preserving model performance across both dense and mixture-of-experts architectures, significantly improving training goodput under frequent failure conditions.

Technology Category

Application Category

📝 Abstract

Modern large-scale language model pre-training relies heavily on the single program multiple data (SPMD) paradigm, which requires tight coupling across accelerators. Due to this coupling, transient slowdowns, hardware failures, and synchronization overhead stall the entire computation, wasting significant compute time at scale. While recent distributed methods like DiLoCo reduced communication bandwidth, they remained fundamentally synchronous and vulnerable to these system stalls. To address this, we introduce Decoupled DiLoCo, an evolution of the DiLoCo framework designed to break the lock-step synchronization barrier and go beyond SPMD to maximize training goodput. Decoupled DiLoCo partitions compute across multiple independent ``learners'' that execute local inner optimization steps. These learners asynchronously communicate parameter fragments to a central synchronizer, which circumvents failed or straggling learners by aggregating updates using a minimum quorum, an adaptive grace window, and dynamic token-weighted merging. Inspired by ``chaos engineering'', we achieve significantly improved training efficiency in failure-prone environments with millions of simulated chips with strictly zero global downtime, while maintaining competitive model performance across text and vision tasks, for both dense and mixture-of-expert architectures.

Problem

Research questions and friction points this paper is trying to address.

SPMD

synchronization overhead

hardware failures

straggler problem

distributed pre-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled DiLoCo

asynchronous distributed training

fault tolerance