VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) struggle with long-video understanding due to inadequate contextual modeling, primarily caused by neglecting the intrinsic shot structure of videos—leading proxy-based approaches to incorporate redundant or noisy temporal segments. To address this, we propose a *chain-of-shot reasoning* paradigm that formalizes long-video understanding as a shot-level causal inference chain. Our method emulates human-like deep interactive comprehension via progressive shot filtering and coarse-to-fine multimodal reasoning. It integrates shot detection, hierarchical attention, and dynamic context retrieval to build a lightweight MLLM-based video agent framework. Evaluated on VideoMME and EgoSchema, our approach achieves 77.0 and 70.1, respectively—outperforming InternVL2.5-8B by up to 10.8%. Remarkably, it attains performance comparable to GPT-4o and Gemini 1.5 Pro using only 7% of the frames and 12% of the inference latency.

Technology Category

Application Category

📝 Abstract
The recent advance in video understanding has been driven by multimodal large language models (MLLMs). But these MLLMs are good at analyzing short videos, while suffering from difficulties in understanding videos with a longer context. To address this difficulty, several agent paradigms have recently been proposed, using MLLMs as agents for retrieving extra contextual knowledge in a long video. However, most existing agents ignore the key fact that a long video is composed with multiple shots, i.e., to answer the user question from a long video, it is critical to deeply understand its relevant shots like human. Without such insight, these agents often mistakenly find redundant even noisy temporal context, restricting their capacity for long video understanding. To fill this gap, we propose VideoChat-A1, a novel long video agent paradigm. Different from the previous works, our VideoChat-A1 can deeply think with long videos, via a distinct chain-of-shot reasoning paradigm. More specifically, it can progressively select the relevant shots of user question, and look into these shots in a coarse-to-fine partition. By multi-modal reasoning along the shot chain, VideoChat-A1 can effectively mimic step-by-step human thinking process, allowing to interactively discover preferable temporal context for thoughtful understanding in long videos. Extensive experiments show that, our VideoChat-A1 achieves the state-of-the-art performance on the mainstream long video QA benchmarks, e.g., it achieves 77.0 on VideoMME and 70.1 on EgoSchema, outperforming its strong baselines (e.g., Intern2.5VL-8B and InternVideo2.5-8B), by up to 10.8% and 6.2%. Compared to leading close-source GPT-4o and Gemini 1.5 Pro, VideoChat-A1 offers competitive accuracy, but with 7% input frames and 12% inference time on average.
Problem

Research questions and friction points this paper is trying to address.

Enhancing long video understanding via shot-based reasoning
Reducing redundant temporal context in video analysis
Mimicking human step-by-step thinking for video QA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-shot reasoning for long videos
Coarse-to-fine shot partition analysis
Multi-modal reasoning mimicking human thinking
🔎 Similar Papers
Zikang Wang
Zikang Wang
Institute of Automation, Chinese Academy of Sciences
Boyu Chen
Boyu Chen
The University of Sydney
Neural Architecture SearchTransformer
Zhengrong Yue
Zhengrong Yue
Shanghai Jiao Tong University, PhD
Unified Multimodal ModelingVideo UnderstandingVideo Generation
Y
Yi Wang
Shanghai Artificial Intelligence Laboratory
Y
Yu Qiao
Shanghai Artificial Intelligence Laboratory
L
Limin Wang
Nanjing University, Shanghai Artificial Intelligence Laboratory
Y
Yali Wang
Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shanghai Artificial Intelligence Laboratory