🤖 AI Summary
This work addresses the limitation of existing long-form video question answering benchmarks, which predominantly rely on local cues and fail to evaluate models’ capacity for deep narrative reasoning—such as tracking character intentions, linking distant events, and reconstructing causal chains across entire films. To this end, we introduce NA-VQA, a new benchmark comprising 88 full-length movies and 4.4K open-ended questions requiring cross-scene information integration. We further propose Video-NaRA, a framework that constructs event-level narrative chains and stores them in structured memory to support long-range reasoning. By incorporating multi-span evidence annotation and a narrative-centric inference mechanism, Video-NaRA moves beyond shallow matching paradigms. Experimental results demonstrate that our approach significantly outperforms current methods on NA-VQA, achieving a 3% performance gain on questions involving distant evidence and substantially enhancing comprehension of complex narrative structures.
📝 Abstract
Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture narrative reasoning, the ability to track intentions, connect distant events, and reconstruct causal chains across an entire movie. We introduce NA-VQA, a benchmark designed to evaluate deep temporal and narrative reasoning in long-form videos. NA-VQA contains 88 full-length movies and 4.4K open-ended question-answer pairs, each grounded in multiple evidence spans labeled as Short, Medium, or Far to assess long-range dependencies. By requiring generative, multi-scene answers, NA-VQA tests whether models can integrate dispersed narrative information rather than rely on shallow pattern matching. To address the limitations of existing approaches, we propose Video-NaRA, a narrative-centric framework that builds event-level chains and stores them in a structured memory for retrieval during reasoning. Extensive experiments show that state-of-the-art MLLMs perform poorly on questions requiring far-range evidence, highlighting the need for explicit narrative modeling. Video-NaRA improves long-range reasoning performance by up to 3 percent, demonstrating its effectiveness in handling complex narrative structures. We will release NA-VQA upon publication.