🤖 AI Summary
Large-model training on heterogeneous GPU clusters faces challenges in balancing performance and monetary cost, with existing auto-parallel frameworks neglecting explicit cost modeling.
Method: We propose the first auto-parallel strategy search framework that explicitly models and jointly optimizes monetary cost—integrating high-fidelity dual-objective mathematical models for training time and total cost, combinatorial optimization search, lightweight performance prediction, and joint tuning of multi-dimensional parameters (GPU model, count, and parallelism configuration).
Contribution/Results: This work pioneers incorporating both hardware acquisition and operational costs into auto-parallel search. Under >95% strategy accuracy, it achieves ultra-fast search times—1.27 seconds per GPU and <1.35 minutes cluster-wide—while outperforming expert-designed manual strategies in throughput and significantly reducing end-to-end training cost.
📝 Abstract
In this paper, we introduce an efficient and money-saving automatic parallel strategies search framework on heterogeneous GPUs: Astra. First, Astra searches for the efficiency-optimal parallel strategy in both GPU configurations search space (GPU types and GPU numbers) and parallel parameters search space. Then, Astra also provides the solution on heterogeneous GPUs by mathematically modeling the time consumption of heterogeneous training. At last, Astra is the first to propose the automatic parallel strategy search on money-saving. The experiment results demonstrate that Astra can achieve better throughput than expert-designed strategies. The search time cost for Astra can also be limited to 1.27 seconds in a single-GPU setting and less than 1.35 minutes in a heterogeneous-GPU setting on average with an accuracy of over 95%.