GPU Memory Prediction for Multimodal Model Training

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Frequent GPU out-of-memory (OOM) errors during multimodal large model training, coupled with poor generalizability of existing unimodal memory prediction methods, hinder efficient resource utilization and cause training interruptions. Method: This paper proposes the first layer-granularity GPU memory peak prediction framework tailored for multimodal models. It jointly models memory consumption across heterogeneous modalities—such as vision and language components—by parsing multimodal architectural heterogeneity, capturing inter-layer memory coupling, and integrating a training-trajectory-aware factorized estimation mechanism. Contribution/Results: Experimental evaluation demonstrates that the method achieves a mean absolute percentage error (MAPE) of only 8.7% across diverse multimodal tasks, significantly outperforming unimodal baselines. It effectively prevents OOM-induced training failures and enhances GPU resource utilization, enabling robust and scalable multimodal model training.

Technology Category

Application Category

📝 Abstract
As deep learning models in agentic AI systems grow in scale and complexity, GPU memory requirements increase and often exceed the available GPU memory capacity, so that out-of-memory (OoM) errors occur. It is well known that OoM interrupts the whole training itself and wastes substantial computational resources. Therefore, to prevent OoM, accurate prediction of GPU memory usage is essential. However, previous studies focus only on unimodal architectures and fail to generalize to multimodal models, even though the multimodal models are a common choice in agentic AI systems. To address this limitation, we propose a framework that predicts the peak GPU memory usage by analyzing the model architecture and training behavior of multimodal models. Specifically, the framework decomposes the multimodal model into its constituent layers and applies factorization to estimate the memory usage of each layer. Our evaluation shows that our framework achieves high prediction accuracy of ~8.7% average MAPE.
Problem

Research questions and friction points this paper is trying to address.

Predict GPU memory usage for multimodal models
Prevent out-of-memory errors during training
Generalize memory prediction beyond unimodal architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts GPU memory for multimodal models
Decomposes model layers for memory estimation
Uses factorization to estimate per-layer usage
🔎 Similar Papers
No similar papers found.
Jinwoo Jeong
Jinwoo Jeong
Ph.D student at Korea University
M
Minchul Kang
Korea University, Seoul, Republic of Korea
Y
Younghun Go
Korea University, Seoul, Republic of Korea
C
Changyong Shin
Korea University, Seoul, Republic of Korea
H
Hyunho Lee
Korea University, Seoul, Republic of Korea
Junho Yoon
Junho Yoon
KT Corporation, Seoul, Republic of Korea
Gyeongsik Yang
Gyeongsik Yang
Korea University
Operating systemsNetwork virtualizationDatacenter networkingDistributed deep learning
C
Chuck Yoo
Korea University, Seoul, Republic of Korea