VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of existing video agents in long-form video understanding, which typically rely on dense frame sampling and incur substantial computational overhead. The authors propose a novel active evidence-seeking mechanism grounded in video logical flow, integrating tool-guided multi-granularity observation within a think-act-observe loop to enable query-aware, efficient localization of critical evidence. This approach is the first to combine video logical flow with tool-guided exploration, significantly reducing the number of sampled frames while enhancing reasoning capability. Experimental results demonstrate consistent superiority over current methods across four video understanding benchmarks; notably, the model achieves a 10.2 absolute point improvement over the GPT-5 baseline on LVBench while using 93% fewer frames.

Technology Category

Application Category

📝 Abstract
Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.
Problem

Research questions and friction points this paper is trying to address.

video agent
long-horizon video understanding
computational cost
frame efficiency
video-language tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

video logic flow
tool-guided seeking
long-horizon video agent
query-aware exploration
multi-granular observation
🔎 Similar Papers
No similar papers found.