RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large multimodal models (LMMs) struggle to efficiently comprehend minute- to hour-long videos due to the absence of explicit long-term memory and structured reasoning mechanisms. Method: We propose a spatiotemporal graph-enhanced retrieval–reasoning framework. Its core innovation is constructing an explicit spatiotemporal graph as a retrievable long-term memory repository, and—novelty first—coupling sparse frame retrieval (5–10 frames) with graph neural network–driven multi-step compositional reasoning to enable cross-frame object tracking and multi-hop complex question answering. The method includes query decomposition, stepwise execution, and multimodal feature alignment. Results: Our approach significantly outperforms state-of-the-art methods on NExT-QA and EgoSchema, achieving substantial gains in multi-hop reasoning and long-range tracking while incurring minimal retrieval overhead—thereby alleviating key memory and modeling bottlenecks in LMMs.

Technology Category

Application Category

📝 Abstract
Comprehending long videos remains a significant challenge for Large Multi-modal Models (LMMs). Current LMMs struggle to process even minutes to hours videos due to their lack of explicit memory and retrieval mechanisms. To address this limitation, we propose RAVU (Retrieval Augmented Video Understanding), a novel framework for video understanding enhanced by retrieval with compositional reasoning over a spatio-temporal graph. We construct a graph representation of the video, capturing both spatial and temporal relationships between entities. This graph serves as a long-term memory, allowing us to track objects and their actions across time. To answer complex queries, we decompose the queries into a sequence of reasoning steps and execute these steps on the graph, retrieving relevant key information. Our approach enables more accurate understanding of long videos, particularly for queries that require multi-hop reasoning and tracking objects across frames. Our approach demonstrate superior performances with limited retrieved frames (5-10) compared with other SOTA methods and baselines on two major video QA datasets, NExT-QA and EgoSchema.
Problem

Research questions and friction points this paper is trying to address.

Enhancing video understanding with retrieval-augmented compositional reasoning
Addressing LMMs' inability to process long videos without memory mechanisms
Improving multi-hop reasoning and object tracking in lengthy videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval augmented video understanding framework
Compositional reasoning over spatio-temporal graph
Graph representation as long-term memory
🔎 Similar Papers
No similar papers found.
S
Sameer Malik
Fujitsu Research of India Private Limited
M
Moyuru Yamada
Fujitsu Research of India Private Limited
Ayush Singh
Ayush Singh
Cigna, Northeastern University, Boston Children's Hospital, Harvard Medical School
Machine LearningDeep LearningComputer VisionNatural Language ProcessingBioInformatics
D
Dishank Aggarwal
Fujitsu Research of India Private Limited