WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training

📅 2025-03-23

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

In 4D parallel training (data, tensor, pipeline, and context parallelism) of large language models, pipeline and context-level load imbalance severely degrades GPU utilization and training efficiency. Method: This paper proposes a holistic co-optimization framework that jointly addresses these imbalances. It introduces (i) load-aware variable-length document packing and per-document fine-grained sharding—first enabling strict computational and communication load balancing across micro-batches and context groups; (ii) dynamic document packing; (iii) context-level sharding; (iv) 4D scheduling; and (v) workload-aware micro-batch assignment. Contribution/Results: The approach achieves an average 1.23× end-to-end speedup across multiple model scales, significantly mitigates load skew, and substantially improves GPU cluster utilization—demonstrating the first solution to achieve rigorous load balance across both pipeline and context dimensions in 4D parallel LLM training.

Technology Category

Application Category

📝 Abstract

In this work, we present WLB-LLM, a workLoad-balanced 4D parallelism for large language model training. We first thoroughly analyze the workload imbalance issue in LLM training and identify two primary sources of imbalance at the pipeline parallelism and context parallelism levels. Then, to address the imbalance issue, at the pipeline parallelism level, WLB-LLM incorporates a workload-aware variable-length document packing method to balance the computation and communication workload across micro-batches. Additionally, at the context parallelism level, WLB-LLM introduces a novel fine-grained per-document sharding strategy, ensuring each worker within a context parallelism group has an identical workload. Comprehensive experiments under different model scales demonstrate that WLB-LLM significantly mitigates the workload imbalance during 4D parallelism LLM training and achieves an average speedup of 1.23x when applying WLB-LLM in our internal LLM training framework.

Problem

Research questions and friction points this paper is trying to address.

Address workload imbalance in 4D LLM training

Balance computation and communication in pipeline parallelism

Ensure equal workload in context parallelism groups

Innovation

Methods, ideas, or system contributions that make the work stand out.

Workload-aware variable-length document packing method

Fine-grained per-document sharding strategy

Balanced computation and communication workload

🔎 Similar Papers

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models