VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing video large language models rely on static frame sampling, which struggles to effectively capture temporal dynamics in long videos, leading to weak localization capabilities and information loss. This work proposes a method to construct high-quality training data without requiring genuine long-video understanding: leveraging a strong language model to generate multi-step tool-interaction trajectories in the video description space, then aligning these with videos through video-caption alignment, spatiotemporal scaling, and synthetic trajectory backfilling to build a large-scale tool-reasoning dataset. The resulting video large language model significantly outperforms both caption-only language model agents and strong video baselines across multiple long-video benchmarks, demonstrating the effectiveness of tool-augmented synthetic data in enhancing dynamic reasoning and temporal comprehension.

Technology Category

Application Category

📝 Abstract

Long-form video understanding remains a fundamental challenge for current Video Large Language Models. Most existing models rely on static reasoning over uniformly sampled frames, which weakens temporal localization and leads to substantial information loss in long videos. Agentic tools such as temporal retrieval, spatial zoom, and temporal zoom offer a natural way to overcome these limitations by enabling adaptive exploration of key moments. However, constructing agentic video understanding data requires models that already possess strong long-form video comprehension, creating a circular dependency. We address this challenge with VideoThinker, an agentic Video Large Language Model trained entirely on synthetic tool interaction trajectories. Our key idea is to convert videos into rich captions and employ a powerful agentic language model to generate multi-step tool use sequences in caption space. These trajectories are subsequently grounded back to video by replacing captions with the corresponding frames, yielding a large-scale interleaved video and tool reasoning dataset without requiring any long-form understanding from the underlying model. Training on this synthetic agentic dataset equips VideoThinker with dynamic reasoning capabilities, adaptive temporal exploration, and multi-step tool use. Remarkably, VideoThinker significantly outperforms both caption-only language model agents and strong video model baselines across long-video benchmarks, demonstrating the effectiveness of tool augmented synthetic data and adaptive retrieval and zoom reasoning for long-form video understanding.

Problem

Research questions and friction points this paper is trying to address.

long-form video understanding

Video Large Language Models

agentic tools

temporal localization

information loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic VideoLLM

synthetic tool interaction

adaptive temporal exploration