VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models are constrained by limited context length, making it challenging to accurately localize sparse yet critical query-relevant segments within long videos. To address this, this work proposes a novel approach that integrates query-segment relevance with temporal dependencies among segments, constructing a visual-temporal affinity graph and introducing an iterative hypothesis-verification-refinement mechanism for global clue localization. This method is the first to jointly leverage external query signals and the intrinsic structural cues of the video, thereby overcoming the limitations of conventional local grounding strategies. Experimental results demonstrate significant performance gains across multiple established benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long.

Technology Category

Application Category

📝 Abstract
Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/
Problem

Research questions and friction points this paper is trying to address.

long video understanding
multimodal large language models
query-relevant video segments
video intrinsic structure
sparse observation
Innovation

Methods, ideas, or system contributions that make the work stand out.

long video understanding
multimodal large language models
visual-temporal affinity graph
clue localization
relevance propagation
🔎 Similar Papers
No similar papers found.
R
Ruoliu Yang
Nanjing University
C
Chu Wu
Nanjing University
Caifeng Shan
Caifeng Shan
Philips Research
Computer VisionPattern RecognitionMachine LearningImage/Video Analysis
R
Ran He
Institute of Automation, Chinese Academy of Sciences
Chaoyou Fu
Chaoyou Fu
Nanjing University
Multimodal LLMLLMBiometrics