OrchMLLM: Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multimodal large language model (MLLM) training, inconsistent modality composition across samples induces intra-microbatch modality distribution imbalance, exacerbating GPU load imbalance under data parallelism and severely limiting training efficiency and scalability. To address this bottleneck, we propose a synergistic framework comprising the Batch Post-Balancing Dispatcher and the MLLM Global Orchestrator—the first systematic approach to model and mitigate this issue. Our solution integrates dynamic batch scheduling, cross-device modality orchestration, and sequence-level post-batching rebalancing, fully compatible with mainstream training paradigms. Evaluated on 2,560 NVIDIA H100 GPUs training an 84B tri-modal MLLM, our method achieves a 41.6% FLOPs utilization—surpassing Megatron-LM by 3.1× in throughput—and significantly advances large-scale MLLM training efficacy.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs), such as GPT-4o, are garnering significant attention. During the exploration of MLLM training, we identified Modality Composition Incoherence, a phenomenon that the proportion of a certain modality varies dramatically across different examples. It exacerbates the challenges of addressing mini-batch imbalances, which lead to uneven GPU utilization between Data Parallel (DP) instances and severely degrades the efficiency and scalability of MLLM training, ultimately affecting training speed and hindering further research on MLLMs. To address these challenges, we introduce OrchMLLM, a comprehensive framework designed to mitigate the inefficiencies in MLLM training caused by Modality Composition Incoherence. First, we propose Batch Post-Balancing Dispatcher, a technique that efficiently eliminates mini-batch imbalances in sequential data. Additionally, we integrate MLLM Global Orchestrator into the training framework to orchestrate multimodal data and tackle the issues arising from Modality Composition Incoherence. We evaluate OrchMLLM across various MLLM sizes, demonstrating its efficiency and scalability. Experimental results reveal that OrchMLLM achieves a Model FLOPs Utilization (MFU) of $41.6%$ when training an 84B MLLM with three modalities on $2560$ H100 GPUs, outperforming Megatron-LM by up to $3.1 imes$ in throughput.
Problem

Research questions and friction points this paper is trying to address.

Address modality proportion imbalance in MLLM training
Improve GPU utilization across Data Parallel instances
Enhance efficiency and scalability of multimodal training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Batch Post-Balancing Dispatcher eliminates imbalances
MLLM Global Orchestrator manages multimodal data
OrchMLLM enhances training efficiency and scalability