LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

In long-video understanding, sparse and scattered temporal evidence frequently induces hallucinations in large multimodal models (LMMs). To address this, we propose LongVT—the first end-to-end multimodal agent framework that natively models the model’s temporal localization capability as a callable video cropping tool, enabling an iterative “global browsing → local scrutiny” reasoning chain. Our approach integrates tool-augmented chain-of-thought reasoning, adaptive video resampling, cold-start fine-tuning for tool integration, and a three-stage agent reinforcement learning pipeline—collectively enhancing sparse evidence capture. LongVT achieves state-of-the-art performance across four long-video benchmarks. Furthermore, we introduce VideoSIAH, a new benchmark dataset comprising 247.9K training samples and 1,280 human-verified question-answer pairs, which we publicly release to foster research in long-video understanding.

Technology Category

Application Category

📝 Abstract

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

Problem

Research questions and friction points this paper is trying to address.

Addresses hallucinations in long video reasoning by LMMs

Enables global-to-local analysis through multimodal tool integration

Solves sparse evidence issue with temporal grounding techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Native video cropping tool for temporal grounding

Global-to-local reasoning loop with multimodal chain

Three-stage training strategy for agentic reinforcement learning

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs