🤖 AI Summary
To address high communication overhead and low hardware utilization in large language model (LLM) training across geographically distributed, heterogeneous hardware, this paper proposes a hierarchical asynchronous optimization framework. It introduces a two-tier architecture comprising regional local parameter servers and a global parameter server, enabling asynchronous local SGD, cross-region model merging, and server-side update accumulation. We present the first asynchronous update mechanism incorporating hierarchical momentum and provide rigorous convergence guarantees for non-convex objectives. Experiments on geo-distributed LLM training demonstrate that our method achieves 7.5× faster convergence than synchronous SGD and 2.1× faster than existing asynchronous baselines, while maintaining or even exceeding the model accuracy of full synchronous SGD.
📝 Abstract
Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5x faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1x. Crucially, HALoS preserves the model quality of fully synchronous SGD-matching or exceeding accuracy on standard language modeling and downstream benchmarks-while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments.