🤖 AI Summary
To address low training efficiency and imbalanced resource utilization in large language model (LLM) training on heterogeneous GPU clusters, this paper formulates asymmetric parallel partitioning as a constrained optimization problem and proposes a hierarchical graph partitioning algorithm for heterogeneous-aware scheduling across data, pipeline, and tensor parallelism dimensions. The method eliminates reliance on homogeneous hardware, achieving near-optimal hardware utilization for 7B–30B models: HexiScale attains MFU within 3.5% of that on high-end homogeneous GPU systems on average, narrowing the gap to just 0.3% in best-case scenarios. Key contributions are: (1) a formal constrained optimization formulation for heterogeneous LLM training; (2) an efficient hierarchical graph partitioning scheme supporting asymmetric computational load distribution; and (3) the first heterogeneous training framework empirically validated on mainstream LLM scales to achieve performance approaching that of homogeneous systems.
📝 Abstract
Training large language model (LLM) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. We explore an alternative approach by deploying training computations across heterogeneous GPUs to enable better flexibility and efficiency for heterogeneous resource utilization. To achieve this goal, we propose a novel system, HexiScale, that can flexibly support asymmetric partition of training computations in the scope of data-, pipeline-, and tensor model parallelism. We further formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient hierarchical graph partitioning algorithm. Our approach effectively allocates training computations across GPUs, fully leveraging the available computational power. We conduct empirical studies to evaluate the performance of HexiScale with state-of-the-art homogeneous and heterogeneous training systems. When training LLMs at different scales (from 7B to 30B), HexiScale achieves comparable MFU when running over heterogeneous GPUs compared to state-of-the-art training systems running over homogeneous high-performance GPUs with the same total peak FLOPS. The percentage gaps in MFU between HexiScale and comparable homogeneous settings are as low as $0.3%$, with an average of $3.5%$.