RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding

πŸ“… 2025-03-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing long-video understanding benchmarks (e.g., Video-MME, MLVU) rely on uniform frame sampling, which often discards semantically critical frames and severely compromises the evaluation accuracy of multimodal large language models (MLLMs). To address this, we propose RAG-Adapterβ€”a plug-and-play framework introducing the first retrieval-augmented, evaluation-oriented frame sampling paradigm. It employs multimodal frame-question relevance modeling and a dynamic sampling strategy to enable question-aware, adaptive frame selection. Additionally, we design Group-wise Contrastive Learning (GCL) to optimize sampling quality on our newly constructed MMAT dataset. Evaluated on Video-MME and other benchmarks, RAG-Adapter boosts GPT-4o’s accuracy by 9.3%, substantially improving evaluation reliability. This work establishes a more precise, reproducible standard for assessing long-video understanding capabilities.

Technology Category

Application Category

πŸ“ Abstract
Multi-modal Large Language Models (MLLMs) capable of video understanding are advancing rapidly. To effectively assess their video comprehension capabilities, long video understanding benchmarks, such as Video-MME and MLVU, are proposed. However, these benchmarks directly use uniform frame sampling for testing, which results in significant information loss and affects the accuracy of the evaluations in reflecting the true abilities of MLLMs. To address this, we propose RAG-Adapter, a plug-and-play framework that reduces information loss during testing by sampling frames most relevant to the given question. Additionally, we introduce a Grouped-supervised Contrastive Learning (GCL) method to further enhance sampling effectiveness of RAG-Adapter through fine-tuning on our constructed MMAT dataset. Finally, we test numerous baseline MLLMs on various video understanding benchmarks, finding that RAG-Adapter sampling consistently outperforms uniform sampling (e.g., Accuracy of GPT-4o increases by 9.3 percent on Video-MME), providing a more accurate testing method for long video benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Addresses information loss in long video understanding benchmarks.
Proposes RAG-Adapter for relevant frame sampling in video tests.
Enhances MLLM evaluation accuracy using Grouped-supervised Contrastive Learning.
Innovation

Methods, ideas, or system contributions that make the work stand out.

RAG-Adapter reduces information loss via relevant frame sampling.
Grouped-supervised Contrastive Learning enhances sampling effectiveness.
RAG-Adapter outperforms uniform sampling in video benchmarks.
πŸ”Ž Similar Papers
X
Xichen Tan
College of Computer Science and Technology, National University of Defense Technology, Changsha, China
Yunfan Ye
Yunfan Ye
National University of Defense Technology
Low-level VisionComputer GraphicsEdge Detection
Y
Yuanjing Luo
College of Computer and Mathematics, Central South University of Forestry and Technology, Changsha, China
Q
Qian Wan
Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan, China
F
Fang Liu
School of Design, Hunan University, Changsha, China
Z
Zhiping Cai
College of Computer Science and Technology, National University of Defense Technology, Changsha, China