LensWalk: Agentic Video Understanding by Planning How You See in Videos

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing video understanding methods, which struggle to dynamically acquire evidence due to a disconnect between reasoning and perception. The authors propose an agent-based framework that endows large language models, for the first time, with the ability to actively plan the temporal scope and sampling density of video observations. By establishing a closed-loop “reason–plan–observe” process, the model performs on-demand, progressive evidence collection. The framework integrates a multimodal toolset built upon vision-language models, enabling wide-range scanning, localized focus, and cross-temporal evidence fusion—all without requiring fine-tuning. Evaluated on long-video benchmarks such as LVBench and Video-MME, the approach achieves accuracy gains exceeding 5%, substantially improving the model’s accuracy, robustness, and interpretability.

Technology Category

Application Category

📝 Abstract
The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent's evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5\% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.
Problem

Research questions and friction points this paper is trying to address.

video understanding
reasoning-perception disconnect
active visual observation
temporal video analysis
agentic reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic video understanding
active perception
reason-plan-observe loop
dynamic video sampling
vision-language reasoning
🔎 Similar Papers
No similar papers found.
K
Keliang Li
Institute of Computing Technology, Chinese Academy of Sciences, China
Y
Yansong Li
College of Computer Science and Electronic Engineering, Hunan University, China
H
Hongze Shen
Institute of Computing Technology, Chinese Academy of Sciences, China; Peng Cheng Laboratory, China; University of Chinese Academy of Sciences, China
Mengdi Liu
Mengdi Liu
Institute of Computing Technology, Chinese Academy of Sciences
Diffusion modelsAI4Science
Hong Chang
Hong Chang
Researcher at Institute of Computing Technology, Chinese Academy of Sciences
Machine LearningComputer VisionPattern Recognition
Shiguang Shan
Shiguang Shan
Professor of Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionPattern RecognitionMachine LearningFace Recognition