🤖 AI Summary
To address challenges in distributed ML training across dynamic, heterogeneous, and geographically dispersed clusters—including inefficient resource allocation, poor throughput-cost trade-offs, and insufficient heterogeneity support in existing frameworks—this paper proposes a full-stack automated training system. The system integrates lightweight runtime/memory simulation, adaptive search-space pruning, and a heterogeneity-aware training framework. Key contributions include: (1) a performance-modeling–guided efficient configuration search algorithm; (2) a heterogeneity-compatible distributed training runtime; (3) a cross-Availability-Zone resource scheduling mechanism; and (4) dynamic topology-aware communication optimization. Evaluated on real-world heterogeneous clusters, the system achieves 2.1× higher training throughput and reduces per-task cost by 37% compared to state-of-the-art baselines. It converges to optimal configurations within minutes and requires zero manual hyperparameter tuning.
📝 Abstract
The high GPU demand of ML training makes it hard to allocate large homogeneous clusters of high-end GPUs in a single availability zone. Leveraging heterogeneous GPUs available within and across zones can improve throughput at a reasonable cost. However, training ML models on heterogeneous resources introduces significant challenges, such as stragglers and a large search space of possible job configurations. Current systems lack support for efficiently training models on heterogeneous resources. We present Sailor, a system that automates distributed training over heterogeneous, geo-distributed, and dynamically available resources. Sailor combines an efficient search space exploration algorithm, accurate runtime and memory footprint simulation, and a distributed training framework that supports different types of heterogeneity to optimize training throughput and cost.