Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of processing long videos with multimodal large language models (MLLMs) under GPU memory constraints, this paper proposes an efficient one-stage retrieval-augmented paradigm. Methodologically, it introduces (1) a query-guided video chunking algorithm that preserves both semantic coherence and knowledge integrity; (2) the first unified framework jointly performing video-clip retrieval and cross-modal retrieval in a single step; and (3) a progressive training strategy to enhance instruction-following capability. The model is fine-tuned on the SynLongVideo dataset and seamlessly integrates into mainstream MLLMs. Experiments demonstrate that InternLV2-8B and Qwen2-VL-7B achieve GPT-4o-level performance across multiple long-video benchmarks. Notably, both models process one-hour videos in just 2.2 minutes using a single RTX 4090 GPU—significantly improving efficiency without sacrificing accuracy.

Technology Category

Application Category

📝 Abstract
Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into five recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting InternLV2 8B and Qwen2-VL 7B to the level of GPT-4o on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 2.2 minutes on a single 4090 GPU.
Problem

Research questions and friction points this paper is trying to address.

Enables MLLMs to process long videos efficiently
Improves video understanding via one-shot clip retrieval
Unifies clip chunking and retrieval in one step
Innovation

Methods, ideas, or system contributions that make the work stand out.

One-shot video clip retrieval for augmentation
Query-guided chunking unifies retrieval and processing
Progressive training with synthetic long video dataset
🔎 Similar Papers
No similar papers found.
T
Tao Chen
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
S
Shaobo Ju
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Q
Qiong Wu
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
C
Chenxin Fang
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
K
Kun Zhang
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Jun Peng
Jun Peng
PhD, Soochow University, Australian National University
Photovoltaics
H
Hui Li
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Yiyi Zhou
Yiyi Zhou
Xiamen University
deep learninglanguage and vision
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.