Improving training time and GPU utilization in geo-distributed language model training

📅 2024-11-16
🏛️ arXiv.org
📈 Citations: 4
Influential: 2
📄 PDF
🤖 AI Summary
To address low efficiency in large-scale language model training across geographically distributed data centers—caused by WAN bandwidth constraints and GPU idle bubbles—this paper proposes ATLAS, a time-aware WAN bandwidth sharing mechanism, and BubbleTEA, a co-scheduling technique that dynamically deploys prefill inference services during training idle periods. By integrating WAN-aware communication scheduling, a prefill-as-a-service architecture, and fine-grained GPU bubble detection and filling, the approach achieves resource-level synergy between training and inference. Experiments demonstrate up to 17× end-to-end training speedup, 94% average GPU utilization, and substantial reductions in cross-DC communication overhead, waiting latency, and total cost. The core innovation lies in explicitly modeling training idle bubbles as schedulable service resources and establishing the first joint optimization framework for WAN bandwidth and time-varying computational resources.

Technology Category

Application Category

📝 Abstract
The widespread adoption of language models (LMs) across multiple industries has caused huge surge in demand for GPUs. Training LMs requires tens of thousands of GPUs and housing them in the same datacenter (DCs) is becoming challenging. We focus on training such models across multiple DCs connected via Wide-Area-Network (WAN). We build ATLAS that speeds up such training time using novel temporal bandwidth sharing and many other design choices. While ATLAS improves the training time, it does not eliminate the bubbles (idle GPU cycles). We built BUBBLETEA that runs prefill-as-a-service (part of LM inference) during the bubbles that improves the GPU utilization substantially without any impact of training. Together, ATLAS and BUBBLETEA improve training time by up to 17X and achieve GPU utilization of up to 94%.
Problem

Research questions and friction points this paper is trying to address.

Speeding up geo-distributed language model training time
Improving GPU utilization during idle training cycles
Eliminating performance bubbles in multi-datacenter LM training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Atlas enables geo-distributed training via bandwidth sharing
BubbleTea runs inference service during GPU idle cycles
Combined system achieves faster training and high utilization
🔎 Similar Papers
No similar papers found.
P
Palak Lnu
Microsoft Research India
Rohan Gandhi
Rohan Gandhi
Purdue University, Carnegie Mellon University, Microsoft Research
Computer Systems and NetworksSystems for LLMsAI Agents
K
Karan Tandon
Microsoft Research India
D
Debopam Bhattacherjee
Microsoft Research India
Venkata N. Padmanabhan
Venkata N. Padmanabhan
Microsoft Research India