Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Training large language models (LLMs) on heterogeneous GPU clusters—characterized by cross-generation GPUs, multi-datacenter deployment, and significant disparities in memory capacity and computational power—faces challenges including load imbalance, memory mismatch, and inefficient inter-device communication. To address these, this paper proposes Zorse, a system designed for efficient LLM training in such environments. Its core contributions include: (1) the first integrated parallelism paradigm combining asymmetric pipeline stage partitioning with intra-heterogeneous-group data parallelism; (2) a memory-aware dynamic allocation strategy and an adaptive communication scheduling mechanism; and (3) a hardware-feature-aware training strategy planner that enables resource-utilization-driven automatic configuration optimization. Extensive experiments across diverse real-world heterogeneous setups demonstrate that Zorse achieves, on average, a 2.1× improvement in training throughput, 37% GPU memory reduction, and 42% lower communication overhead compared to state-of-the-art systems.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) require vast amounts of GPU compute to train, but limited availability and high costs of GPUs make homogeneous clusters impractical for many organizations. Instead, assembling heterogeneous clusters by pooling together GPUs of different generations allows them to achieve higher aggregate compute and make use of all available GPUs. However, training on heterogeneous clusters presents several challenges, including load balancing across GPUs, optimizing memory usage to accommodate varying memory capacities, and ensuring communication-efficient training over diverse network interconnects potentially spanning multiple datacenters. In this paper, we make the case that efficient training on heterogeneous clusters requires (1) the integration of pipeline parallelism and data parallelism in a manner that is both communication- and memory-efficient, and (2) a more adaptable configuration of pipeline and data parallelism, which includes the capability to flexibly partition GPUs into asymmetric pipeline parallel stages and to incorporate heterogeneous GPUs within the same data parallelism group. We propose Zorse, the first system to unify all these capabilities while incorporating a planner that automatically configures training strategies for a given workload. Our evaluation shows that Zorse significantly outperforms state-of-the-art systems in heterogeneous training scenarios.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM training on heterogeneous GPU clusters
Balancing load and memory across diverse GPUs
Enhancing communication efficiency in multi-datacenter setups
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates pipeline and data parallelism efficiently
Adapts asymmetric GPU partitions flexibly
Automates training strategy configuration
R
Runsheng Benson Guo
University of Waterloo, Waterloo, Canada
U
Utkarsh Anand
University of Waterloo, Waterloo, Canada
K
Khuzaima Daudjee
University of Waterloo, Waterloo, Canada
Rathijit Sen
Rathijit Sen
Microsoft Gray Systems Lab
Computer ArchitectureDatabase SystemsMachine Learning