NarrativeTrack: Evaluating Video Language Models Beyond the Frame

📅 2026-01-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the notable limitations of existing video-language models in capturing the temporal continuity and narrative evolution of entities within dynamic visual scenes. To this end, we introduce NarrativeTrack, the first systematic benchmark for evaluating video narrative understanding. It employs an automated entity-centric pipeline and a Compositional Reasoning Progression (CRP) mechanism to assess models’ fine-grained narrative reasoning capabilities across three progressive dimensions: existence, change, and ambiguity. Our experiments reveal that while general-purpose open-source models exhibit strong perceptual abilities, they struggle with temporal modeling; conversely, video-specialized models demonstrate contextual awareness but are prone to entity hallucination. These findings underscore fundamental shortcomings in current approaches regarding cross-frame entity consistency and contextual narrative evolution.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have achieved impressive progress in vision-language reasoning, yet their ability to understand temporally unfolding narratives in videos remains underexplored. True narrative understanding requires grounding who is doing what, when, and where, maintaining coherent entity representations across dynamic visual and temporal contexts. We introduce NarrativeTrack, the first benchmark to evaluate narrative understanding in MLLMs through fine-grained entity-centric reasoning. Unlike existing benchmarks limited to short clips or coarse scene-level semantics, we decompose videos into constituent entities and examine their continuity via a Compositional Reasoning Progression (CRP), a structured evaluation framework that progressively increases narrative complexity across three dimensions: entity existence, entity changes, and entity ambiguity. CRP challenges models to advance from temporal persistence to contextual evolution and fine-grained perceptual reasoning. A fully automated entity-centric pipeline enables scalable extraction of temporally grounded entity representations, providing the foundation for CRP. Evaluations of state-of-the-art MLLMs reveal that models fail to robustly track entities across visual transitions and temporal dynamics, often hallucinating identity under context shifts. Open-source general-purpose MLLMs exhibit strong perceptual grounding but weak temporal coherence, while video-specific MLLMs capture temporal context yet hallucinate entity's contexts. These findings uncover a fundamental trade-off between perceptual grounding and temporal reasoning, indicating that narrative understanding emerges only from their integration. NarrativeTrack provides the first systematic framework to diagnose and advance temporally grounded narrative comprehension in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

narrative understanding

video language models

temporal reasoning

entity tracking

multimodal large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Narrative Understanding

Entity-Centric Reasoning

Temporal Grounding