HAPT: Heterogeneity-Aware Automated Parallel Training on Heterogeneous Clusters

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low resource utilization in distributed training on heterogeneous GPU clusters, this paper proposes an automated parallel training framework. Methodologically, it introduces: (1) a fine-grained cross-operator parallelism planner that enables hardware-aware, customized operator partitioning for heterogeneous devices; (2) a heterogeneity-aware 1F1B (one-forward-one-backward) scheduling mechanism that dynamically reorders micro-batch execution to maximize compute-communication overlap; and (3) an integrated optimization combining load balancing, communication cost modeling, and cross-cluster memory/bandwidth adaptation. Experimental evaluation on real-world heterogeneous GPU clusters demonstrates that the framework achieves 1.3–1.6× speedup over state-of-the-art systems—including PyTorch DDP and DeepSpeed—while significantly alleviating communication bottlenecks and improving end-to-end training throughput and hardware resource utilization.

Technology Category

Application Category

📝 Abstract
With the rapid evolution of GPU architectures, the heterogeneity of model training infrastructures is steadily increasing. In such environments, effectively utilizing all available heterogeneous accelerators becomes critical for distributed model training. However, existing frameworks, which are primarily designed for homogeneous clusters, often exhibit significant resource underutilization when deployed on heterogeneous accelerators and networks. In this paper, we present Hapt, an automated parallel training framework designed specifically for heterogeneous clusters. Hapt introduces a fine-grained planner that efficiently searches a wide space for the inter-operator parallel strategy, enabling Hapt to alleviate communication overheads while maintaining balanced loads across heterogeneous accelerators. In addition, Hapt implements a heterogeneity-aware 1F1B scheduler that adaptively adjusts the execution timing and ordering of microbatches based on network characteristics, maximizing computation-communication overlap under cross-cluster interconnects while incurring only minimal memory overhead. Our evaluation results show that Hapt can deliver 1.3x-1.6x higher performance on heterogeneous clusters than state-of-the-art training frameworks.
Problem

Research questions and friction points this paper is trying to address.

Optimizing distributed training on heterogeneous GPU clusters
Reducing communication overheads across varied network interconnects
Balancing computational loads among diverse accelerator architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained planner searches inter-operator parallel strategies
Heterogeneity-aware scheduler adapts microbatch execution timing
Maximizes computation-communication overlap with minimal memory overhead
🔎 Similar Papers
No similar papers found.
A
Antian Liang
Fudan University, Shanghai, China
Zhigang Zhao
Zhigang Zhao
Fudan University, Shanghai, China
K
Kai Zhang
Fudan University, Shanghai, China
X
Xuri Shi
Fudan University, Shanghai, China
C
Chuantao Li
Shandong Computer Science Center (National Supercomputer Center in Jinan), Jinan, Shandong, China
C
Chunxiao Wang
Shandong Computer Science Center (National Supercomputer Center in Jinan), Jinan, Shandong, China
Z
Zhenying He
Fudan University, Shanghai, China
Y
Yinan Jing
Fudan University, Shanghai, China
X. Sean Wang
X. Sean Wang
School of Computer Science, Fudan University
Database SystemsInformation Security and PrivacyWireless Sensor NetworksStreaming Data Processing Time Series QueriesDat