🤖 AI Summary
In 4D parallel training (data, tensor, pipeline, and context parallelism) of large language models, pipeline and context-level load imbalance severely degrades GPU utilization and training efficiency.
Method: This paper proposes a holistic co-optimization framework that jointly addresses these imbalances. It introduces (i) load-aware variable-length document packing and per-document fine-grained sharding—first enabling strict computational and communication load balancing across micro-batches and context groups; (ii) dynamic document packing; (iii) context-level sharding; (iv) 4D scheduling; and (v) workload-aware micro-batch assignment.
Contribution/Results: The approach achieves an average 1.23× end-to-end speedup across multiple model scales, significantly mitigates load skew, and substantially improves GPU cluster utilization—demonstrating the first solution to achieve rigorous load balance across both pipeline and context dimensions in 4D parallel LLM training.
📝 Abstract
In this work, we present WLB-LLM, a workLoad-balanced 4D parallelism for large language model training. We first thoroughly analyze the workload imbalance issue in LLM training and identify two primary sources of imbalance at the pipeline parallelism and context parallelism levels. Then, to address the imbalance issue, at the pipeline parallelism level, WLB-LLM incorporates a workload-aware variable-length document packing method to balance the computation and communication workload across micro-batches. Additionally, at the context parallelism level, WLB-LLM introduces a novel fine-grained per-document sharding strategy, ensuring each worker within a context parallelism group has an identical workload. Comprehensive experiments under different model scales demonstrate that WLB-LLM significantly mitigates the workload imbalance during 4D parallelism LLM training and achieves an average speedup of 1.23x when applying WLB-LLM in our internal LLM training framework.