🤖 AI Summary
In long-video understanding, sparse and scattered temporal evidence frequently induces hallucinations in large multimodal models (LMMs). To address this, we propose LongVT—the first end-to-end multimodal agent framework that natively models the model’s temporal localization capability as a callable video cropping tool, enabling an iterative “global browsing → local scrutiny” reasoning chain. Our approach integrates tool-augmented chain-of-thought reasoning, adaptive video resampling, cold-start fine-tuning for tool integration, and a three-stage agent reinforcement learning pipeline—collectively enhancing sparse evidence capture. LongVT achieves state-of-the-art performance across four long-video benchmarks. Furthermore, we introduce VideoSIAH, a new benchmark dataset comprising 247.9K training samples and 1,280 human-verified question-answer pairs, which we publicly release to foster research in long-video understanding.
📝 Abstract
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .