SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing RAG methods for long-video understanding suffer from scene fragmentation and contextual discontinuity due to fixed, rigid video segmentation. To address this, we propose Scene-RAG, a scene-level retrieval-augmented generation framework. Our approach introduces three key innovations: (1) an LLM-driven, narrative-consistency-aware scene segmentation paradigm that dynamically partitions videos by jointly leveraging ASR transcripts and temporal metadata to ensure semantic coherence; (2) a vision-language multimodal dynamic scene knowledge graph enabling multi-hop reasoning and long-range dependency modeling; and (3) a lightweight heuristic iterative correction mechanism to enhance segmentation robustness. Evaluated on the 134+ hour LongerVideos benchmark, Scene-RAG achieves a 72.5% win rate on generative tasks—significantly outperforming state-of-the-art baselines—and empirically validates the critical importance of scene-level structural modeling for long-video understanding.

Technology Category

Application Category

📝 Abstract

Despite recent advances in retrieval-augmented generation (RAG) for video understanding, effectively understanding long-form video content remains underexplored due to the vast scale and high complexity of video data. Current RAG approaches typically segment videos into fixed-length chunks, which often disrupts the continuity of contextual information and fails to capture authentic scene boundaries. Inspired by the human ability to naturally organize continuous experiences into coherent scenes, we present SceneRAG, a unified framework that leverages large language models to segment videos into narrative-consistent scenes by processing ASR transcripts alongside temporal metadata. SceneRAG further sharpens these initial boundaries through lightweight heuristics and iterative correction. For each scene, the framework fuses information from both visual and textual modalities to extract entity relations and dynamically builds a knowledge graph, enabling robust multi-hop retrieval and generation that account for long-range dependencies. Experiments on the LongerVideos benchmark, featuring over 134 hours of diverse content, confirm that SceneRAG substantially outperforms prior baselines, achieving a win rate of up to 72.5 percent on generation tasks.

Problem

Research questions and friction points this paper is trying to address.

Effective understanding of long-form video content remains challenging

Current RAG methods disrupt contextual continuity in videos

Dynamic scene-level knowledge fusion for robust video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segments videos into narrative-consistent scenes

Fuses visual and textual modalities for knowledge graphs

Enables robust multi-hop retrieval and generation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs