Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in distributed ML training across dynamic, heterogeneous, and geographically dispersed clusters—including inefficient resource allocation, poor throughput-cost trade-offs, and insufficient heterogeneity support in existing frameworks—this paper proposes a full-stack automated training system. The system integrates lightweight runtime/memory simulation, adaptive search-space pruning, and a heterogeneity-aware training framework. Key contributions include: (1) a performance-modeling–guided efficient configuration search algorithm; (2) a heterogeneity-compatible distributed training runtime; (3) a cross-Availability-Zone resource scheduling mechanism; and (4) dynamic topology-aware communication optimization. Evaluated on real-world heterogeneous clusters, the system achieves 2.1× higher training throughput and reduces per-task cost by 37% compared to state-of-the-art baselines. It converges to optimal configurations within minutes and requires zero manual hyperparameter tuning.

Technology Category

Application Category

📝 Abstract
The high GPU demand of ML training makes it hard to allocate large homogeneous clusters of high-end GPUs in a single availability zone. Leveraging heterogeneous GPUs available within and across zones can improve throughput at a reasonable cost. However, training ML models on heterogeneous resources introduces significant challenges, such as stragglers and a large search space of possible job configurations. Current systems lack support for efficiently training models on heterogeneous resources. We present Sailor, a system that automates distributed training over heterogeneous, geo-distributed, and dynamically available resources. Sailor combines an efficient search space exploration algorithm, accurate runtime and memory footprint simulation, and a distributed training framework that supports different types of heterogeneity to optimize training throughput and cost.
Problem

Research questions and friction points this paper is trying to address.

Automating distributed training on heterogeneous, geo-distributed clusters
Addressing stragglers and large job configuration search spaces
Optimizing training throughput and cost with dynamic resource availability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates training on heterogeneous geo-distributed resources
Uses efficient search space exploration algorithm
Simulates runtime and memory footprint accurately
🔎 Similar Papers
No similar papers found.
Foteini Strati
Foteini Strati
ETH Zurich
Computer SystemsMachine LearningCloud Computing
Zhendong Zhang
Zhendong Zhang
Xidian University
Machine LearningComputer VisionAI
G
George Manos
ETH Zurich, Switzerland
I
Ixeia S'anchez P'eriz
ETH Zurich, Switzerland
Q
Qinghao Hu
MIT, USA
T
Tiancheng Chen
ETH Zurich, Switzerland
B
Berk Buzcu
HES-SO, Switzerland
S
Song Han
MIT, USA
P
Pamela Delgado
HES-SO, Switzerland
Ana Klimovic
Ana Klimovic
ETH Zurich
Computer systemsCloud ComputingComputer Architecture