Addressing Variable Heterogeneity in Distributed Multimodal Training with Entrain

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of cross-modal and inter-batch load imbalance in multimodal large language model training, which arises from data heterogeneity and sample-level coupling. To resolve this, the authors propose Entrain, a distributed training framework that innovatively shifts performance analysis from the sample level to the macro-batch level. They demonstrate that a single static model-parallel configuration can achieve optimal load balancing, overturning the conventional reliance on dynamic parallelism. Entrain integrates a hierarchical micro-batch allocation algorithm that stabilizes intra-iteration workload while maintaining static configurations. Experimental results show that the approach reduces load disparity among micro-batches by up to 10.6× and improves end-to-end training throughput by as much as 1.40×.

📝 Abstract

Multimodal LLM datasets are inherently heterogeneous, with significant data variability. Although each modality exhibits independent variability, sample-level entanglement makes it difficult to balance workloads across both modalities and batches. We present Entrain, a distributed MLLM training framework that addresses both heterogeneity and variability in multimodal training workloads. Entrain challenges the intuition that dynamic data variability requires dynamic model parallelism by shifting the profiling paradigm from micro-level samples to macroscopic batches. We prove that a single, static model-parallel configuration suffices for optimal load balancing under this paradigm. At the microscopic scale, Entrain introduces a hierarchical microbatch assignment algorithm that defers excess workload within each iteration to stabilize variability across microbatches. Evaluations show that Entrain reduces workload variability across microbatches by up to 10.6$\times$, improving end-to-end training throughput by up to 1.40$\times$ over existing baselines.

Problem

Research questions and friction points this paper is trying to address.

Variable Heterogeneity

Distributed Multimodal Training

Workload Balancing

Multimodal LLM

Data Variability

Innovation

Methods, ideas, or system contributions that make the work stand out.

distributed multimodal training

model parallelism

workload balancing