Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation

📅 2024-08-07
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Current MLLM training suffers from substantial GPU compute bubbles due to data dependencies between heterogeneous modalities (e.g., ViT and GPT) and 3D parallelism, severely degrading end-to-end training throughput. To address this, we propose the first paradigm that schedules encoder computation *within* LLM compute bubbles. Our approach jointly optimizes architectural constraints and hardware efficiency via a decoupled parallelism planning search and a microsecond-level bubble-aware dynamic scheduling algorithm. We further integrate kernel-level encoder layer decomposition, bubble pattern modeling, and a unified heterogeneous model co-training framework. Evaluated on a production-scale 3,072-GPU cluster (ViT-22B + GPT-175B), our method achieves a 20.5–21.3% improvement in training throughput over state-of-the-art baselines. This work establishes a highly efficient and scalable pathway for large-scale multimodal foundation model training.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have extended the success of large language models (LLMs) to multiple data types, such as image, text and audio, achieving significant performance in various domains, including multimodal translation, visual question answering and content generation. Nonetheless, existing systems are inefficient to train MLLMs due to substantial GPU bubbles caused by the heterogeneous modality models and complex data dependencies in 3D parallelism. This paper proposes Optimus, a distributed MLLM training system that reduces end-to-end MLLM training time. Optimus is based on our principled analysis that scheduling the encoder computation within the LLM bubbles can reduce bubbles in MLLM training. To make scheduling encoder computation possible for all GPUs, Optimus searches the separate parallel plans for encoder and LLM, and adopts a bubble scheduling algorithm to enable exploiting LLM bubbles without breaking the original data dependencies in the MLLM model architecture. We further decompose encoder layer computation into a series of kernels, and analyze the common bubble pattern of 3D parallelism to carefully optimize the sub-millisecond bubble scheduling, minimizing the overall training time. Our experiments in a production cluster show that Optimus accelerates MLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUs compared to baselines.
Problem

Research questions and friction points this paper is trying to address.

Reducing GPU bubbles in MLLM training
Optimizing encoder computation scheduling
Accelerating multi-modal LLM training efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Schedules encoder computation within LLM bubbles
Searches separate parallel plans for encoder and LLM
Optimizes sub-millisecond bubble scheduling