OrchMLLM: Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

In multimodal large language model (MLLM) training, inconsistent modality composition across samples induces intra-microbatch modality distribution imbalance, exacerbating GPU load imbalance under data parallelism and severely limiting training efficiency and scalability. To address this bottleneck, we propose a synergistic framework comprising the Batch Post-Balancing Dispatcher and the MLLM Global Orchestrator—the first systematic approach to model and mitigate this issue. Our solution integrates dynamic batch scheduling, cross-device modality orchestration, and sequence-level post-batching rebalancing, fully compatible with mainstream training paradigms. Evaluated on 2,560 NVIDIA H100 GPUs training an 84B tri-modal MLLM, our method achieves a 41.6% FLOPs utilization—surpassing Megatron-LM by 3.1× in throughput—and significantly advances large-scale MLLM training efficacy.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs), such as GPT-4o, are garnering significant attention. During the exploration of MLLM training, we identified Modality Composition Incoherence, a phenomenon that the proportion of a certain modality varies dramatically across different examples. It exacerbates the challenges of addressing mini-batch imbalances, which lead to uneven GPU utilization between Data Parallel (DP) instances and severely degrades the efficiency and scalability of MLLM training, ultimately affecting training speed and hindering further research on MLLMs. To address these challenges, we introduce OrchMLLM, a comprehensive framework designed to mitigate the inefficiencies in MLLM training caused by Modality Composition Incoherence. First, we propose Batch Post-Balancing Dispatcher, a technique that efficiently eliminates mini-batch imbalances in sequential data. Additionally, we integrate MLLM Global Orchestrator into the training framework to orchestrate multimodal data and tackle the issues arising from Modality Composition Incoherence. We evaluate OrchMLLM across various MLLM sizes, demonstrating its efficiency and scalability. Experimental results reveal that OrchMLLM achieves a Model FLOPs Utilization (MFU) of $41.6%$ when training an 84B MLLM with three modalities on $2560$ H100 GPUs, outperforming Megatron-LM by up to $3.1 imes$ in throughput.

Problem

Research questions and friction points this paper is trying to address.

Address modality proportion imbalance in MLLM training

Improve GPU utilization across Data Parallel instances

Enhance efficiency and scalability of multimodal training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Batch Post-Balancing Dispatcher eliminates imbalances

MLLM Global Orchestrator manages multimodal data

OrchMLLM enhances training efficiency and scalability

🔎 Similar Papers

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models