🤖 AI Summary
This work addresses the inefficiencies in training multimodal large language models caused by highly heterogeneous data, which lead to load imbalance, redundant communication, and suboptimal hardware utilization under static parallelization strategies. To overcome these limitations, the authors propose Dynamic Hybrid Parallelism (DHP), a novel approach that, for the first time, supports non-power-of-two parallelism degrees. DHP incorporates a polynomial-time algorithm with millisecond-level overhead to dynamically generate near-optimal parallel configurations. By adaptively reconfiguring communication groups and adjusting parallelism degrees during training, DHP significantly enhances training efficiency. Experiments on large-scale NPU clusters demonstrate up to a 1.36× throughput improvement over Megatron-LM and DeepSpeed, while maintaining near-linear scaling efficiency.
📝 Abstract
Scaling long-context capabilities is crucial for Multimodal Large Language Models (MLLMs). However, real-world multimodal datasets are extremely heterogeneous. Existing training frameworks predominantly rely on static parallelism strategies, which suffer from severe load imbalance, redundant communication, and suboptimal hardware utilization under data heterogeneity. In this work, we propose Dynamic Hybrid Parallelism (DHP), an efficient parallelism strategy that adaptively reconfigures communication groups and parallelism degrees during MLLM training. We generalize the non-power-of-two parallelism degrees and develop a polynomial-time algorithm to generate near-optimal parallelism strategies with only millisecond-level overhead per training batch. DHP is able to maintain high hardware efficiency even under extreme data variability. Experimental results demonstrate that DHP significantly outperforms Megatron-LM and DeepSpeed, achieving up to 1.36 $\times$ speedup in training throughput while maintaining near-linear scaling efficiency across large-scale NPU clusters.