Video-BrowseComp: Benchmarking Agentic Video Research on Open Web

📅 2025-12-28

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Existing video benchmarks emphasize passive perception and fail to evaluate agent-like capabilities—namely, active temporal querying, cross-source evidence integration, and open-web verification. Method: We introduce AVR, the first open-web, agent-oriented video reasoning benchmark, comprising 210 complex questions requiring fine-grained visual-temporal reasoning. AVR enforces strict reliance on raw video frames and temporal cues, explicitly excluding textual metadata as a shortcut. It targets high-dynamics, low-metadata domains (e.g., sports, gaming) and incorporates timestamp localization, multi-hop evidence linking, and cross-modal claim verification, supporting joint evaluation of LLMs, web search, and video understanding modules. Contribution/Results: Experiments reveal severe limitations in current state-of-the-art systems (e.g., GPT-5.1+Search achieves only 15.24% accuracy), quantitatively exposing and characterizing the critical bottleneck of vision-intensive agent capabilities in autonomous video reasoning—a previously unaddressed gap in the literature.

Technology Category

Application Category

📝 Abstract

The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, while textual and static multimodal agents have seen rapid progress, a significant modality gap remains in processing the web's most dynamic modality: video. Existing video benchmarks predominantly focus on passive perception, feeding curated clips to models without requiring external retrieval. They fail to evaluate agentic video research, which necessitates actively interrogating video timelines, cross-referencing dispersed evidence, and verifying claims against the open web. To bridge this gap, we present extbf{Video-BrowseComp}, a challenging benchmark comprising 210 questions tailored for open-web agentic video reasoning. Unlike prior benchmarks, Video-BrowseComp enforces a mandatory dependency on temporal visual evidence, ensuring that answers cannot be derived solely through text search but require navigating video timelines to verify external claims. Our evaluation of state-of-the-art models reveals a critical bottleneck: even advanced search-augmented models like GPT-5.1 (w/ Search) achieve only 15.24% accuracy. Our analysis reveals that these models largely rely on textual proxies, excelling in metadata-rich domains (e.g., TV shows with plot summaries) but collapsing in metadata-sparse, dynamic environments (e.g., sports, gameplay) where visual grounding is essential. As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.

Problem

Research questions and friction points this paper is trying to address.

Benchmarks agentic video research requiring active web interrogation

Evaluates temporal visual evidence navigation for claim verification

Addresses modality gap in dynamic video processing versus text

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for agentic video reasoning on open web

Mandatory temporal visual evidence dependency for answers

Evaluates proactive video timeline navigation and verification

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding