Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters

📅 2025-04-23

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

To address challenges in distributed ML training across dynamic, heterogeneous, and geographically dispersed clusters—including inefficient resource allocation, poor throughput-cost trade-offs, and insufficient heterogeneity support in existing frameworks—this paper proposes a full-stack automated training system. The system integrates lightweight runtime/memory simulation, adaptive search-space pruning, and a heterogeneity-aware training framework. Key contributions include: (1) a performance-modeling–guided efficient configuration search algorithm; (2) a heterogeneity-compatible distributed training runtime; (3) a cross-Availability-Zone resource scheduling mechanism; and (4) dynamic topology-aware communication optimization. Evaluated on real-world heterogeneous clusters, the system achieves 2.1× higher training throughput and reduces per-task cost by 37% compared to state-of-the-art baselines. It converges to optimal configurations within minutes and requires zero manual hyperparameter tuning.

Technology Category

Application Category

📝 Abstract

The high GPU demand of ML training makes it hard to allocate large homogeneous clusters of high-end GPUs in a single availability zone. Leveraging heterogeneous GPUs available within and across zones can improve throughput at a reasonable cost. However, training ML models on heterogeneous resources introduces significant challenges, such as stragglers and a large search space of possible job configurations. Current systems lack support for efficiently training models on heterogeneous resources. We present Sailor, a system that automates distributed training over heterogeneous, geo-distributed, and dynamically available resources. Sailor combines an efficient search space exploration algorithm, accurate runtime and memory footprint simulation, and a distributed training framework that supports different types of heterogeneity to optimize training throughput and cost.

Problem

Research questions and friction points this paper is trying to address.

Automating distributed training on heterogeneous, geo-distributed clusters

Addressing stragglers and large job configuration search spaces

Optimizing training throughput and cost with dynamic resource availability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates training on heterogeneous geo-distributed resources

Uses efficient search space exploration algorithm

Simulates runtime and memory footprint accurately

🔎 Similar Papers

No similar papers found.