Astra: Efficient and Money-saving Automatic Parallel Strategies Search on Heterogeneous GPUs

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Large-model training on heterogeneous GPU clusters faces challenges in balancing performance and monetary cost, with existing auto-parallel frameworks neglecting explicit cost modeling. Method: We propose the first auto-parallel strategy search framework that explicitly models and jointly optimizes monetary cost—integrating high-fidelity dual-objective mathematical models for training time and total cost, combinatorial optimization search, lightweight performance prediction, and joint tuning of multi-dimensional parameters (GPU model, count, and parallelism configuration). Contribution/Results: This work pioneers incorporating both hardware acquisition and operational costs into auto-parallel search. Under >95% strategy accuracy, it achieves ultra-fast search times—1.27 seconds per GPU and <1.35 minutes cluster-wide—while outperforming expert-designed manual strategies in throughput and significantly reducing end-to-end training cost.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce an efficient and money-saving automatic parallel strategies search framework on heterogeneous GPUs: Astra. First, Astra searches for the efficiency-optimal parallel strategy in both GPU configurations search space (GPU types and GPU numbers) and parallel parameters search space. Then, Astra also provides the solution on heterogeneous GPUs by mathematically modeling the time consumption of heterogeneous training. At last, Astra is the first to propose the automatic parallel strategy search on money-saving. The experiment results demonstrate that Astra can achieve better throughput than expert-designed strategies. The search time cost for Astra can also be limited to 1.27 seconds in a single-GPU setting and less than 1.35 minutes in a heterogeneous-GPU setting on average with an accuracy of over 95%.

Problem

Research questions and friction points this paper is trying to address.

Efficient parallel strategy search

Heterogeneous GPU optimization

Money-saving automatic solution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic parallel strategy search

Heterogeneous GPU optimization

Money-saving mathematical modeling

🔎 Similar Papers

UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming