DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization

📅 2025-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the communication bottleneck in distributed LLM training under low-bandwidth, geo-distributed settings—and the fundamental limitation of Local SGD in failing to overlap communication with computation—this paper proposes a layer-granularity decoupled local synchronization mechanism. Our method introduces (1) a novel hierarchical partial synchronization strategy that enables asynchronous, demand-driven synchronization per model layer; (2) a rigorous theoretical convergence analysis; and (3) dynamic scheduling based on fine-grained temporal modeling, achieving full communication-computation overlap with zero additional GPU memory overhead. We evaluate our approach on ResNet, GPT-2, and Llama-2 training across 32 GPUs. Compared to Local SGD and Adam, our method achieves superior convergence and end-to-end speedups of 1.49×–3.91×.

Technology Category

Application Category

📝 Abstract
The growth of large language models (LLMs) increases challenges of accelerating distributed training across multiple GPUs in different data centers. Moreover, concerns about data privacy and data exhaustion have heightened interest in geo-distributed data centers. Communication in geo-distributed data parallel training (DDP) with stochastic gradient descent (S-SGD) is the main bottleneck in low-bandwidth environments. Local SGD mitigates communication overhead by reducing synchronization frequency, and recent studies have successfully applied it to geo-distributedly pre-train LLMs. However, we identify that its model synchronization mechanism prevents overlapping communication and computation, which makes the system lose opportunities to overlap communication and computation. To overcome this limitation, we expand the design space of local SGD by layer-wisely decoupling model synchronization. In each iteration, only some layers are synchronized instead of the entire model after a specific number of iterations. Leveraging this methodology, we introduce DreamDDP, a training framework to accelerate low-bandwidth distributed training with three key innovations: (1) partial local SGD with theoretical assurances of convergence rates comparable to S-SGD; (2) overlapping parameter synchronization with computation without extra GPU memory occupation; (3) identifying and exploiting three properties to schedule the communication and computation to reduce the training time based on fine-grained profiling of layer-wise communication and computation time. Empirical evaluations conducted on 32 GPUs using prominent deep learning models, including ResNet-18, ResNet-50, GPT-2, and Llama-2, demonstrate that DreamDDP enhances the convergence properties of Local SGD (and Adam) and achieves speedups ranging from $1.49 imes$ to $3.91 imes$ over leading baseline methods.
Problem

Research questions and friction points this paper is trying to address.

Accelerates distributed LLM training across geo-distributed data centers
Reduces communication overhead by layer-wise partial synchronization
Overlaps communication and computation to enhance training efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise decoupling model synchronization
Overlapping parameter synchronization with computation
Fine-grained profiling for communication and computation scheduling
🔎 Similar Papers
No similar papers found.