SpanTrain: Highly Efficient Cross-domain Model Distributed Training System under Heterogeneous GPUs and Networks in CEE Environment

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low distributed training efficiency and poor strategy adaptability in cloud-edge-end (CEE) cross-domain training—caused by heterogeneous GPU resources and unstable wide-area networks—this paper proposes the first communication-centric architecture. Our approach integrates three key components: (1) an automatic grouping mechanism driven by heterogeneous device profiling, (2) zero-bubble compact pipeline parallelism, and (3) a runtime network fluctuation-aware dynamic reconfiguration regulator. Leveraging performance modeling, automated parallel strategy search, and real-time network adaptation, the system achieves 1.3–2.8× higher training throughput than state-of-the-art and mainstream baselines under realistic CEE settings. It significantly improves cross-domain resource utilization and training robustness, particularly under variable network conditions and diverse hardware configurations.

Technology Category

Application Category

📝 Abstract
Most existing training systems focus on a single region. In contrast, we envision that cross-region training offers more flexible GPU resource allocation and yields significant potential. However, the hierarchical cluster topology and unstable networks in the cloud-edge-end (CEE) environment, a typical cross-region scenario, pose substantial challenges to building an efficient and autonomous model training system. We propose SpanTrain, a geo-distributed model training system tailored for heterogeneous GPUs and networks in CEE environments. SpanTrain adopts a communication-centric design philosophy to tackle challenges arising from slow and unstable inter-region networks. It begins with a heterogeneous device profiler that identifies and groups devices based on both network and compute characteristics. Leveraging device groups, SpanTrain implements compact, zero-bubble pipeline parallelism, automatically deriving optimal parallel strategies. To further adapt to runtime variability, SpanTrain integrates a dynamic environment adapter that reacts to network fluctuations. Extensive evaluations demonstrate that SpanTrain achieves 1.3-2.8x higher training throughput compared to widely used and SOTA training systems.
Problem

Research questions and friction points this paper is trying to address.

Efficient cross-region model training under heterogeneous GPUs and networks
Autonomous system design for unstable cloud-edge-end environments
Optimizing parallel strategies for dynamic network and compute conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous device profiler groups by network and compute
Compact zero-bubble pipeline parallelism for efficiency
Dynamic adapter reacts to network fluctuations automatically
🔎 Similar Papers
No similar papers found.
J
Jinquan Wang
Beihang University, Haidian, Beijing, China
Xiaojian Liao
Xiaojian Liao
Beihang University
Storage SystemAI System
X
Xuzhao Liu
Beihang University, Haidian, Beijing, China
J
Jiashun Suo
Beihang University, Haidian, Beijing, China
Z
Zhisheng Huo
Beihang University, Haidian, Beijing, China
C
Chenhao Zhang
Beihang University, Haidian, Beijing, China
X
Xiangrong Xu
Beihang University, Haidian, Beijing, China
R
Runnan Shen
Beihang University, Haidian, Beijing, China
X
Xilong Xie
Beihang University, Haidian, Beijing, China
Limin Xiao
Limin Xiao
FDU
Fiber OpticsOptoelectronics