SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG methods for long-video understanding suffer from scene fragmentation and contextual discontinuity due to fixed, rigid video segmentation. To address this, we propose Scene-RAG, a scene-level retrieval-augmented generation framework. Our approach introduces three key innovations: (1) an LLM-driven, narrative-consistency-aware scene segmentation paradigm that dynamically partitions videos by jointly leveraging ASR transcripts and temporal metadata to ensure semantic coherence; (2) a vision-language multimodal dynamic scene knowledge graph enabling multi-hop reasoning and long-range dependency modeling; and (3) a lightweight heuristic iterative correction mechanism to enhance segmentation robustness. Evaluated on the 134+ hour LongerVideos benchmark, Scene-RAG achieves a 72.5% win rate on generative tasks—significantly outperforming state-of-the-art baselines—and empirically validates the critical importance of scene-level structural modeling for long-video understanding.

Technology Category

Application Category

📝 Abstract
Despite recent advances in retrieval-augmented generation (RAG) for video understanding, effectively understanding long-form video content remains underexplored due to the vast scale and high complexity of video data. Current RAG approaches typically segment videos into fixed-length chunks, which often disrupts the continuity of contextual information and fails to capture authentic scene boundaries. Inspired by the human ability to naturally organize continuous experiences into coherent scenes, we present SceneRAG, a unified framework that leverages large language models to segment videos into narrative-consistent scenes by processing ASR transcripts alongside temporal metadata. SceneRAG further sharpens these initial boundaries through lightweight heuristics and iterative correction. For each scene, the framework fuses information from both visual and textual modalities to extract entity relations and dynamically builds a knowledge graph, enabling robust multi-hop retrieval and generation that account for long-range dependencies. Experiments on the LongerVideos benchmark, featuring over 134 hours of diverse content, confirm that SceneRAG substantially outperforms prior baselines, achieving a win rate of up to 72.5 percent on generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Effective understanding of long-form video content remains challenging
Current RAG methods disrupt contextual continuity in videos
Dynamic scene-level knowledge fusion for robust video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segments videos into narrative-consistent scenes
Fuses visual and textual modalities for knowledge graphs
Enables robust multi-hop retrieval and generation
🔎 Similar Papers
No similar papers found.
N
Nianbo Zeng
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China; College of Computer Science and Software Engineering, Shenzhen University, China
Haowen Hou
Haowen Hou
Assistant Professor, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
RWKVLLMVLMInformation Retrieval
F
Fei Richard Yu
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China; College of Computer Science and Software Engineering, Shenzhen University, China
Si Shi
Si Shi
Macao Polytechnic University
Financial AIEducational AIDeep Learning
Y
Ying Tiffany He
College of Computer Science and Software Engineering, Shenzhen University, China