InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

📅 2025-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of simultaneously achieving long-term temporal modeling and fine-grained visual understanding in video multimodal large language models (MLLMs), this paper proposes the Long-and-Rich Context (LRC) paradigm. Methodologically, LRC integrates dense vision-task annotations (e.g., object tracking, segmentation) with adaptive hierarchical spatiotemporal token compression, and employs direct preference optimization (DPO) to align model outputs with human perceptual preferences. This design preserves high temporal resolution while significantly reducing representational redundancy, thereby enhancing long-horizon memory retention and attention focus. Experiments demonstrate state-of-the-art performance across both short- and long-video understanding benchmarks; support video inputs up to six times longer than those handled by existing methods; and—critically—enable, for the first time, end-to-end joint understanding of professional vision tasks within an MLLM framework.

Technology Category

Application Category

📝 Abstract
This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs' ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations into MLLMs using direct preference optimization and develops compact spatiotemporal representations through adaptive hierarchical token compression. Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short&long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation. Our work highlights the importance of multimodal context richness (length and fineness) in empowering MLLM's innate abilites (focus and memory), providing new insights for future research on video MLLM. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5
Problem

Research questions and friction points this paper is trying to address.

Video Multimodal Large Language Model
Long Video Information Processing
Middle School Student Comprehension Enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

InternVideo2.5
Video Multimodal Large Language Model (MLLM)
Enhanced Temporal and Spatial Processing
🔎 Similar Papers
No similar papers found.
Y
Yi Wang
Shanghai AI Laboratory
Xinhao Li
Xinhao Li
Nanjing University
Video UnderstandingMultimodal LLMVision-Language Learning
Z
Ziang Yan
Shanghai AI Laboratory
Yinan He
Yinan He
Shanghai Al Laboratory
Jiashuo Yu
Jiashuo Yu
Shanghai AI Laboratory
Audio-Visual LearningComputer VisionMultimodal Learning
X
Xiangyu Zeng
Nanjing University, Shanghai AI Laboratory
Chenting Wang
Chenting Wang
Shanghai Jiao Tong University
Computer VisionVideo Understanding
C
Changlian Ma
Nanjing University, Shanghai AI Laboratory
H
Haian Huang
Shanghai AI Laboratory
J
Jianfei Gao
Shanghai AI Laboratory
Min Dou
Min Dou
Shanghai AI Laboratory
Autonomous DrivingMLLMEmbodied AI
K
Kaiming Chen
Shanghai AI Laboratory
W
Wenhai Wang
Shanghai AI Laboratory
Y
Yu Qiao
Shanghai AI Laboratory
Y
Yali Wang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shanghai AI Laboratory
L
Limin Wang
Nanjing University, Shanghai AI Laboratory