HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high communication overhead and low hardware utilization in large language model (LLM) training across geographically distributed, heterogeneous hardware, this paper proposes a hierarchical asynchronous optimization framework. It introduces a two-tier architecture comprising regional local parameter servers and a global parameter server, enabling asynchronous local SGD, cross-region model merging, and server-side update accumulation. We present the first asynchronous update mechanism incorporating hierarchical momentum and provide rigorous convergence guarantees for non-convex objectives. Experiments on geo-distributed LLM training demonstrate that our method achieves 7.5× faster convergence than synchronous SGD and 2.1× faster than existing asynchronous baselines, while maintaining or even exceeding the model accuracy of full synchronous SGD.

Technology Category

Application Category

📝 Abstract
Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5x faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1x. Crucially, HALoS preserves the model quality of fully synchronous SGD-matching or exceeding accuracy on standard language modeling and downstream benchmarks-while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments.
Problem

Research questions and friction points this paper is trying to address.

Reduces communication costs in geo-distributed LLM training
Minimizes straggler effects in heterogeneous hardware environments
Preserves model quality while lowering total training time
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical local and global parameter servers
Minimizes inter-region communication costs
Leverages fast intra-region links efficiently
🔎 Similar Papers
No similar papers found.