🤖 AI Summary
Cross-regional large language model (LLM) training is severely hindered by high wide-area network (WAN) communication latency. Conventional synchronous methods suffer from low efficiency, while asynchronous approaches—such as Streaming DiLoCo—overlap computation and communication but degrade convergence and induce model inconsistency due to stale updates. This paper proposes the first distributed training framework that jointly enables computation-communication overlap and staleness-aware gradient compensation. Specifically: (1) it introduces a first-order Taylor-based delayed gradient compensation mechanism to explicitly correct outdated parameter updates; and (2) it designs an adaptive model sharding transmission and synchronization scheduling strategy to maximize bandwidth utilization without compromising model consistency. Experiments demonstrate that, at equivalent perplexity, our method reduces required training steps by up to 21.0% compared to Streaming DiLoCo, significantly accelerating convergence and improving final model accuracy.
📝 Abstract
Training large language models (LLMs) requires massive computational resources, often necessitating the aggregation of geographically distributed data centers (ie, cross-region training). However, the high communication latency in wide-area networks severely degrades the efficiency of traditional distributed training. While methods like DiLoCo reduce communication frequency, they suffer from blocking synchronization. Streaming DiLoCo alleviates this issue via communication-computation overlapping but introduces update staleness and model inconsistency due to delayed global updates and partial synchronization. These factors impair convergence, especially when aggressive overlap is needed to mask high latency. We propose CoCoDC, a novel distributed training framework with communication-computation overlapping and delay compensation, to explicitly tackle these challenges. Within the CoCoDC framework, we specifically develop a novel Delay Compensation strategy based on Taylor expansion to effectively mitigate the staleness and an Adaptive Transmission strategy that dynamically schedules model fragment synchronization to optimize bandwidth usage and accelerate convergence. Extensive experiments highlight the superior performance of CoCoDC over both DiLoCo and Streaming DiLoCo regarding final accuracy and training speed. Specifically, CoCoDC reduces the training steps needed to reach a comparable perplexity by up to 21.0% compared to Streaming DiLoCo. Our work provides an effective solution for scalable and efficient cross-region LLM training.