Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-video understanding is hindered by the limited context windows and inadequate long-term temporal modeling capabilities of large vision-language models (LVLMs); existing video retrieval-augmented generation (RAG) methods suffer from broken temporal dependencies and noise susceptibility. To address these challenges, we propose Graph-R2, a graph-enhanced retrieval-reasoning framework that constructs a semantic relation graph over video segments to improve cross-segment retrieval accuracy, and introduces a structured intermediate reasoning mechanism for noise suppression and multi-segment information aggregation. Graph-R2 seamlessly integrates graph neural networks, semantic relation modeling, and verification-based reasoning into open-source LVLMs. Evaluated on three long-video benchmarks—including MLVU—it significantly outperforms strong baselines, achieving absolute gains of 3.0–5.4% on MLVU and surpassing prior video RAG methods by 8.6%. To our knowledge, Graph-R2 is the first framework to jointly leverage graph-structured modeling and stepwise verification reasoning for long-video RAG.

Technology Category

Application Category

📝 Abstract
Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of $3.0%sim 5.4%$ over base models on MLVU, and outperformed state-of-the-art video RAG methods by $8.6%$. Our code is publicly available at https://xiaoqian-shen.github.io/Vgent.
Problem

Research questions and friction points this paper is trying to address.

Addresses processing intensive video tokens beyond context window limitations
Solves disrupted temporal dependencies in video retrieval-augmented generation
Mitigates reasoning limitations of large video language models through structured verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based video representation preserving semantic relationships
Intermediate reasoning step with structured verification
Enhanced retrieval-reasoning framework for long video understanding
🔎 Similar Papers
Xiaoqian Shen
Xiaoqian Shen
CS PhD @ KAUST
Generative ModelsVision-Language
W
Wenxuan Zhang
King Abdullah University of Science and Technology
J
Jun Chen
King Abdullah University of Science and Technology, Meta AI
M
Mohamed Elhoseiny
King Abdullah University of Science and Technology