VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

πŸ“… 2026-04-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing long-form video understanding methods, which are constrained by context length and often overlook the implicit alignment between a video’s spatiotemporal structure and the user’s query intent in retrieval-augmented generation. To overcome these challenges, we propose VideoStir, a novel framework that models videos as spatiotemporal graphs and integrates multi-hop retrieval with a multimodal large language model (MLLM)-based intent-aware relevance scorer. This enables structured, intent-sensitive comprehension of long videos without relying on auxiliary annotations. Our approach surpasses conventional flat semantic matching paradigms and achieves state-of-the-art performance under fully unsupervised conditions. To validate the efficacy of our design, we further introduce IR-600K, a large-scale dataset explicitly aligned with diverse user intents.
πŸ“ Abstract
Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query's intent. To overcome these limitations, we propose VideoStir, a structured and intent-aware long-video RAG framework. It firstly structures a video as a spatio-temporal graph at clip level, and then performs multi-hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM-backed intent-relevance scorer that retrieves frames based on their alignment with the query's reasoning intent. To support this capability, we curate IR-600K, a large-scale dataset tailored for learning frame-query intent alignment. Experiments show that VideoStir is competitive with state-of-the-art baselines without relying on auxiliary information, highlighting the promise of shifting long-video RAG from flattened semantic matching to structured, intent-aware reasoning. Codes and checkpoints are available at Github.
Problem

Research questions and friction points this paper is trying to address.

long video understanding
retrieval-augmented generation
spatio-temporal structure
intent-aware reasoning
multimodal large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatio-temporal graph
intent-aware retrieval
multi-hop reasoning
retrieval-augmented generation
multimodal large language models
πŸ”Ž Similar Papers
No similar papers found.