Improving training time and GPU utilization in geo-distributed language model training

📅 2024-11-16

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 2

career value

222K/year

🤖 AI Summary

To address low efficiency in large-scale language model training across geographically distributed data centers—caused by WAN bandwidth constraints and GPU idle bubbles—this paper proposes ATLAS, a time-aware WAN bandwidth sharing mechanism, and BubbleTEA, a co-scheduling technique that dynamically deploys prefill inference services during training idle periods. By integrating WAN-aware communication scheduling, a prefill-as-a-service architecture, and fine-grained GPU bubble detection and filling, the approach achieves resource-level synergy between training and inference. Experiments demonstrate up to 17× end-to-end training speedup, 94% average GPU utilization, and substantial reductions in cross-DC communication overhead, waiting latency, and total cost. The core innovation lies in explicitly modeling training idle bubbles as schedulable service resources and establishing the first joint optimization framework for WAN bandwidth and time-varying computational resources.

Technology Category

Application Category

📝 Abstract

The widespread adoption of language models (LMs) across multiple industries has caused huge surge in demand for GPUs. Training LMs requires tens of thousands of GPUs and housing them in the same datacenter (DCs) is becoming challenging. We focus on training such models across multiple DCs connected via Wide-Area-Network (WAN). We build ATLAS that speeds up such training time using novel temporal bandwidth sharing and many other design choices. While ATLAS improves the training time, it does not eliminate the bubbles (idle GPU cycles). We built BUBBLETEA that runs prefill-as-a-service (part of LM inference) during the bubbles that improves the GPU utilization substantially without any impact of training. Together, ATLAS and BUBBLETEA improve training time by up to 17X and achieve GPU utilization of up to 94%.

Problem

Research questions and friction points this paper is trying to address.

Speeding up geo-distributed language model training time

Improving GPU utilization during idle training cycles

Eliminating performance bubbles in multi-datacenter LM training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Atlas enables geo-distributed training via bandwidth sharing

BubbleTea runs inference service during GPU idle cycles

Combined system achieves faster training and high utilization

🔎 Similar Papers

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models