BigMac: Breaking the Pareto Frontier of Compute and Memory in Multimodal LLM Training

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Training multimodal large language models faces a Pareto trade-off between computational efficiency and memory consumption. This work proposes BigMac, a training framework that introduces a dependency-safe nested pipeline design to seamlessly integrate encoder and generator computations into the main language model pipeline. By doing so, BigMac reduces the activation memory of both components to O(1) without increasing the activation memory complexity of the primary pipeline, thereby breaking the Pareto frontier in this setting for the first time and achieving simultaneous optimization of computation and memory. Experiments demonstrate that BigMac delivers 1.08–1.9× speedup across diverse multimodal architectures and training workloads, while maintaining stable memory usage even as batch sizes scale up.

📝 Abstract

Training multimodal large language models (MLLMs) is challenged by both model and data heterogeneity. Existing systems redesign the training pipeline to address these challenges, but remain bound by a Pareto frontier between compute and memory efficiency, improving one only at the expense of the other. We present BigMac, a new training pipeline for multimodal LLMs. The core idea of BigMac is to elegantly nest the encoder and generator computation into the original LLM pipeline, forming a dependency-safe nested pipeline structure. With this design, BigMac reduces the activation memory complexity of the encoder and generator to O(1) while keeping the activation memory complexity of the LLM unchanged. At the same time, it achieves the same computational efficiency as the idealized setting with unlimited memory. As a result, BigMac breaks the Pareto frontier between computational efficiency and memory usage, enabling simultaneous optimization of both computation and memory in MLLM training. We evaluate BigMac on multiple MLLMs and training workloads. Experimental results show that BigMac achieves a 1.08$\times$-1.9$\times$ training speedup over baseline systems while maintaining stable memory usage as batch size increases.

Problem

Research questions and friction points this paper is trying to address.

multimodal LLM training

compute-memory trade-off

Pareto frontier

activation memory

training efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

nested pipeline

activation memory complexity

Pareto frontier