VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the limitations of existing long-form video understanding methods, which suffer from performance degradation and frequent hallucinations due to uniform frame sampling that fails to capture critical visual evidence. Concurrent agent-based approaches are further hindered by weak temporal grounding, rigid pipelines, and inefficient reasoning. To overcome these challenges, the authors propose a unified agent-based video reasoning framework that jointly models temporal localization and question answering for the first time. The framework incorporates a noise-resistant exploratory masking mechanism and a reinforcement learning strategy robust to reward hacking. Additionally, the study introduces a high-quality grounded question-answering dataset and a benchmark tailored for long videos. Experiments demonstrate that the proposed method significantly improves performance on both long-video understanding and temporal localization tasks, effectively suppressing hallucinations and enhancing grounding accuracy.

Technology Category

Application Category

📝 Abstract

In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these issues, we propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering. VideoTemp-o3 exhibits strong localization capability, supports on-demand clipping, and can refine inaccurate localizations. Specifically, in the supervised fine-tuning stage, we design a unified masking mechanism that encourages exploration while preventing noise. For reinforcement learning, we introduce dedicated rewards to mitigate reward hacking. Besides, from the data perspective, we develop an effective pipeline to construct high-quality long video grounded QA data, along with a corresponding benchmark for systematic evaluation across various video durations. Experimental results demonstrate that our method achieves remarkable performance on both long video understanding and grounding.

Problem

Research questions and friction points this paper is trying to address.

long-video understanding

temporal grounding

video hallucination

agentic thinking-with-videos

localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal grounding

agentic thinking-with-videos

on-demand clipping

unified masking mechanism

reward hacking mitigation

🔎 Similar Papers

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

2024-10-04arXiv.orgCitations: 36

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

2024-06-19arXiv.orgCitations: 5