🤖 AI Summary
To address low distributed training efficiency and poor strategy adaptability in cloud-edge-end (CEE) cross-domain training—caused by heterogeneous GPU resources and unstable wide-area networks—this paper proposes the first communication-centric architecture. Our approach integrates three key components: (1) an automatic grouping mechanism driven by heterogeneous device profiling, (2) zero-bubble compact pipeline parallelism, and (3) a runtime network fluctuation-aware dynamic reconfiguration regulator. Leveraging performance modeling, automated parallel strategy search, and real-time network adaptation, the system achieves 1.3–2.8× higher training throughput than state-of-the-art and mainstream baselines under realistic CEE settings. It significantly improves cross-domain resource utilization and training robustness, particularly under variable network conditions and diverse hardware configurations.
📝 Abstract
Most existing training systems focus on a single region. In contrast, we envision that cross-region training offers more flexible GPU resource allocation and yields significant potential. However, the hierarchical cluster topology and unstable networks in the cloud-edge-end (CEE) environment, a typical cross-region scenario, pose substantial challenges to building an efficient and autonomous model training system. We propose SpanTrain, a geo-distributed model training system tailored for heterogeneous GPUs and networks in CEE environments. SpanTrain adopts a communication-centric design philosophy to tackle challenges arising from slow and unstable inter-region networks. It begins with a heterogeneous device profiler that identifies and groups devices based on both network and compute characteristics. Leveraging device groups, SpanTrain implements compact, zero-bubble pipeline parallelism, automatically deriving optimal parallel strategies. To further adapt to runtime variability, SpanTrain integrates a dynamic environment adapter that reacts to network fluctuations. Extensive evaluations demonstrate that SpanTrain achieves 1.3-2.8x higher training throughput compared to widely used and SOTA training systems.