Cornstarch: Distributed Multimodal Training Must Be Multimodality-Aware

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing LLM distributed training frameworks struggle to accommodate the heterogeneous modal inputs and hybrid model architectures of multimodal large language models (MLLMs), resulting in suboptimal training efficiency. To address this, we propose the first general-purpose distributed MLLM training framework, introducing a novel multimodal-aware collaborative parallelism paradigm. Our framework supports modular architecture construction, composable submodel parallelism, and multimodal-specific optimizations—including pipeline parallelism and context-aware parallelism. By decoupling multimodal architectures and employing fine-grained scheduling, it jointly optimizes computation, communication, and memory utilization. Experiments on mainstream MLLMs demonstrate up to a 1.57× improvement in training throughput, significantly accelerating large-scale multimodal model training. This work provides a systematic, scalable solution for efficient distributed MLLM training.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) extend the capabilities of large language models (LLMs) by combining heterogeneous model architectures to handle diverse modalities like images and audio. However, this inherent heterogeneity in MLLM model structure and data types makes makeshift extensions to existing LLM training frameworks unsuitable for efficient MLLM training. In this paper, we present Cornstarch, the first general-purpose distributed MLLM training framework. Cornstarch facilitates modular MLLM construction, enables composable parallelization of constituent models, and introduces MLLM-specific optimizations to pipeline and context parallelism for efficient distributed MLLM training. Our evaluation shows that Cornstarch outperforms state-of-the-art solutions by up to $1.57 imes$ in terms of training throughput.

Problem

Research questions and friction points this paper is trying to address.

Efficient training of multimodal large language models (MLLMs)

Challenges in adapting LLM frameworks for MLLM training

Need for MLLM-specific optimizations in distributed training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular construction for multimodal models

Composable parallelization of diverse models

Optimized pipeline and context parallelism

🔎 Similar Papers

No similar papers found.