Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

📅 2026-01-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video question-answering methods rely solely on local visual cues, struggling to address the challenge of open-domain questions whose answers are dispersed across the web. This work proposes VideoDR, the first benchmark for open-web video deep research, spanning six semantic domains and requiring models to jointly perform cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning. A high-quality dataset is constructed via human annotation to systematically evaluate workflow-based and agent-based paradigms in long-chain retrieval tasks. Experiments reveal that agents do not consistently outperform alternative approaches; their effectiveness hinges on the ability to maintain alignment with initial video anchors throughout extended retrieval chains. Target drift and long-term consistency emerge as critical bottlenecks in this setting.

Technology Category

Application Category

📝 Abstract
In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.
Problem

Research questions and friction points this paper is trying to address.

video question answering
open-web retrieval
multi-hop reasoning
cross-frame reasoning
agentic video reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Deep Research
Agentic Video Reasoning
Open-Web Retrieval
Multi-hop Reasoning
Cross-frame Visual Anchors
🔎 Similar Papers
No similar papers found.