🤖 AI Summary
This work addresses the prevalent issue of scene-level semantic forgetting in existing large video models when processing long-form videos, as well as the absence of effective benchmarks for evaluating long-range visual coherence. To tackle these challenges, the authors introduce SceneBench—the first benchmark specifically designed for scene-level reasoning in long videos—built upon human-perception-aligned “scene” units. They further propose Scene-RAG, a framework that enhances long-range contextual modeling through cross-scene memory retrieval and dynamic fusion. Experimental results demonstrate a significant performance drop among current vision-language models on SceneBench, while Scene-RAG achieves a 2.50% absolute accuracy gain, underscoring the critical role of scene-level memory in long video understanding.
📝 Abstract
Long video understanding (LVU) remains a core challenge in multimodal learning. Although recent vision-language models (VLMs) have made notable progress, existing benchmarks mainly focus on either fine-grained perception or coarse summarization, offering limited insight into temporal understanding over long contexts. In this work, we define a scene as a coherent segment of a video in which both visual and semantic contexts remain consistent, aligning with human perception. This leads us to a key question: can current VLMs reason effectively over long, scene-level contexts? To answer this, we introduce a new benchmark, SceneBench, designed to provide scene-level challenges. Our evaluation reveals a sharp drop in accuracy when VLMs attempt to answer scene-level questions, indicating significant forgetting of long-range context. To further validate these findings, we propose Scene Retrieval-Augmented Generation (Scene-RAG), which constructs a dynamic scene memory by retrieving and integrating relevant context across scenes. This Scene-RAG improves VLM performance by +2.50%, confirming that current models still struggle with long-context retention. We hope SceneBench will encourage future research toward VLMs with more robust, human-like video comprehension.