From Image to Video, what do we need in multimodal LLMs?

📅 2024-04-18
🏛️ arXiv.org
📈 Citations: 8
Influential: 0
📄 PDF
🤖 AI Summary
Training video large language models (Video LLMs) incurs prohibitively high computational costs and relies heavily on massive video datasets. Method: This paper proposes RED-VILLM—a resource-efficient development paradigm that leverages off-the-shelf image large language models (Image LLMs) and introduces a lightweight temporal adaptation module to enable efficient knowledge transfer, eliminating redundant architectural design and costly large-scale video pretraining. It achieves, for the first time, efficient cross-modal transfer from Image LLMs to Video LLMs via a plug-and-play temporal modeling structure, integrated with instruction tuning and multi-stage alignment training. Contribution/Results: RED-VILLM attains superior performance over conventional Video LLMs under extremely limited instruction data and computational resources, significantly reducing both computational overhead and dependency on video data. Additionally, this work releases the first open-source Video LLM tailored for the Chinese research community.

Technology Category

Application Category

📝 Abstract
Covering from Image LLMs to the more complex Video LLMs, the Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in comprehending cross-modal information as numerous studies have illustrated. Previous methods delve into designing comprehensive Video LLMs through integrating video foundation models with primitive LLMs. Despite its effectiveness, such paradigm renders Video LLM's structure verbose and typically requires substantial video data for pre-training. Crucially, it neglects leveraging the foundational contributions of ready-made Image LLMs. In this paper, we introduce RED-VILLM, a Resource-Efficient Development pipeline which builds robust Video LLMs through leveraging the prior knowledge of Image LLMs. Specifically, since a video is naturally a combination of images along the temporal dimension, we devise a temporal adaptation plug-and-play structure, endowing the backbone Image LLM with the capability to grasp temporal information. Moreover, through applying this pipeline, we achieve the first Video LLM within the Chinese-speaking community. Extensive experiments demonstrate that Video LLMs developed through our approach surpass conventional Video LLMs, requiring minimal instructional data and training resources. Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models.
Problem

Research questions and friction points this paper is trying to address.

Efficiently develop Video LLMs using Image LLMs
Leverage Image LLMs' knowledge for temporal video understanding
Minimize training data and resources for Video LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Image LLMs for video understanding
Introduces temporal adaptation plug-and-play structure
Requires minimal data and training resources
🔎 Similar Papers
No similar papers found.
S
Suyuan Huang
Beihang University, Beijing, China
H
Haoxin Zhang
Xiaohongshu, Beijing, China
Y
Yan Gao
Xiaohongshu, Beijing, China
Yao Hu
Yao Hu
浙江大学
Machine Learning
Zengchang Qin
Zengchang Qin
Beihang University
Machine LearningMultimedia RetrievalCollective IntelligenceUncertainty Modeling for Data