StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

πŸ“… 2026-04-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

191K/year
πŸ€– AI Summary
Existing video moment retrieval methods excel at action recognition but struggle to comprehend narrative elements such as intentions, mental states, and causal logic. This work introduces Theory of Mind (ToM) into video moment retrieval for the first time, presenting StoryTRβ€”the first high-information-density short-video benchmark specifically designed for ToM reasoning. The authors propose an agent-based data generation pipeline that constructs a three-tier ToM reasoning chain encompassing intention decoding, narrative reasoning, and temporal boundary localization. A 7B-parameter Shorts-Moment model trained on this framework achieves a 15.1% average IoU improvement over baseline methods on StoryTR, significantly outperforming large models such as Gemini-3.0-Pro. These results demonstrate that explicitly modeling narrative reasoning capabilities is more effective than merely scaling up model parameters.

Technology Category

Application Category

πŸ“ Abstract
Current video moment retrieval excels at action-centric tasks but struggles with narrative content. Models can see \textit{what is happening} but fail to reason \textit{why it matters}. This semantic gap stems from the lack of \textbf{Theory of Mind (ToM)}: the cognitive ability to infer implicit intentions, mental states, and narrative causality from surface-level observations. We introduce \textbf{StoryTR}, the first video moment retrieval benchmark requiring ToM reasoning, comprising 8.1k samples from narrative short-form videos (shorts/reels). These videos present an ideal testbed. Their high information density encodes meaning through subtle multimodal cues. For instance, a glance paired with a sigh carries entirely different semantics than the glance alone. Yet multimodal perception alone is insufficient; ToM is required to decode that a character ``smiling'' may actually be ``concealing hostility.'' To teach models this reasoning capability, we propose an \textbf{Agentic Data Pipeline} that generates training data with explicit three-tier ToM chains (intent decoding, narrative reasoning, boundary localization). Experiments reveal the severity of the reasoning gap: Gemini-3.0-Pro achieves only 0.53 Avg IoU on StoryTR. However, our 7B \textbf{Shorts-Moment} model, trained on ToM-guided data, improves +15.1\% relative IoU over baselines, demonstrating that \textit{narrative reasoning capability matters more than parameter scale}.
Problem

Research questions and friction points this paper is trying to address.

video temporal retrieval
narrative understanding
Theory of Mind
semantic gap
multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Theory of Mind
Video Temporal Retrieval
Narrative Reasoning
Agentic Data Pipeline
Multimodal Cues