SpanTrain: Highly Efficient Cross-domain Model Distributed Training System under Heterogeneous GPUs and Networks in CEE Environment

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

To address low distributed training efficiency and poor strategy adaptability in cloud-edge-end (CEE) cross-domain training—caused by heterogeneous GPU resources and unstable wide-area networks—this paper proposes the first communication-centric architecture. Our approach integrates three key components: (1) an automatic grouping mechanism driven by heterogeneous device profiling, (2) zero-bubble compact pipeline parallelism, and (3) a runtime network fluctuation-aware dynamic reconfiguration regulator. Leveraging performance modeling, automated parallel strategy search, and real-time network adaptation, the system achieves 1.3–2.8× higher training throughput than state-of-the-art and mainstream baselines under realistic CEE settings. It significantly improves cross-domain resource utilization and training robustness, particularly under variable network conditions and diverse hardware configurations.

Technology Category

Application Category

📝 Abstract

Most existing training systems focus on a single region. In contrast, we envision that cross-region training offers more flexible GPU resource allocation and yields significant potential. However, the hierarchical cluster topology and unstable networks in the cloud-edge-end (CEE) environment, a typical cross-region scenario, pose substantial challenges to building an efficient and autonomous model training system. We propose SpanTrain, a geo-distributed model training system tailored for heterogeneous GPUs and networks in CEE environments. SpanTrain adopts a communication-centric design philosophy to tackle challenges arising from slow and unstable inter-region networks. It begins with a heterogeneous device profiler that identifies and groups devices based on both network and compute characteristics. Leveraging device groups, SpanTrain implements compact, zero-bubble pipeline parallelism, automatically deriving optimal parallel strategies. To further adapt to runtime variability, SpanTrain integrates a dynamic environment adapter that reacts to network fluctuations. Extensive evaluations demonstrate that SpanTrain achieves 1.3-2.8x higher training throughput compared to widely used and SOTA training systems.

Problem

Research questions and friction points this paper is trying to address.

Efficient cross-region model training under heterogeneous GPUs and networks

Autonomous system design for unstable cloud-edge-end environments

Optimizing parallel strategies for dynamic network and compute conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous device profiler groups by network and compute

Compact zero-bubble pipeline parallelism for efficiency

Dynamic adapter reacts to network fluctuations automatically

🔎 Similar Papers

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models