🤖 AI Summary
To address low efficiency in large-scale language model training across geographically distributed data centers—caused by WAN bandwidth constraints and GPU idle bubbles—this paper proposes ATLAS, a time-aware WAN bandwidth sharing mechanism, and BubbleTEA, a co-scheduling technique that dynamically deploys prefill inference services during training idle periods. By integrating WAN-aware communication scheduling, a prefill-as-a-service architecture, and fine-grained GPU bubble detection and filling, the approach achieves resource-level synergy between training and inference. Experiments demonstrate up to 17× end-to-end training speedup, 94% average GPU utilization, and substantial reductions in cross-DC communication overhead, waiting latency, and total cost. The core innovation lies in explicitly modeling training idle bubbles as schedulable service resources and establishing the first joint optimization framework for WAN bandwidth and time-varying computational resources.
📝 Abstract
The widespread adoption of language models (LMs) across multiple industries has caused huge surge in demand for GPUs. Training LMs requires tens of thousands of GPUs and housing them in the same datacenter (DCs) is becoming challenging. We focus on training such models across multiple DCs connected via Wide-Area-Network (WAN). We build ATLAS that speeds up such training time using novel temporal bandwidth sharing and many other design choices. While ATLAS improves the training time, it does not eliminate the bubbles (idle GPU cycles). We built BUBBLETEA that runs prefill-as-a-service (part of LM inference) during the bubbles that improves the GPU utilization substantially without any impact of training. Together, ATLAS and BUBBLETEA improve training time by up to 17X and achieve GPU utilization of up to 94%.