Act2See: Emergent Active Visual Perception for Video Reasoning

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Current vision-language models for video reasoning rely on static initial frames and struggle to dynamically acquire critical visual information necessary for complex reasoning. This work proposes Act2See, a novel framework that, for the first time, enables models to actively decide—within a textual chain-of-thought—when to retrieve or generate video frames, thereby achieving synergistic dynamic visual perception and reasoning. The approach is trained via supervised fine-tuning on high-quality human-annotated reasoning trajectories and integrates both video frame retrieval and image generation techniques. Experiments demonstrate that Act2See establishes new state-of-the-art performance on VideoEspresso and ViTIB, and outperforms models of comparable or larger scale on Video-MME, EgoNormia, and VCR-Bench.

📝 Abstract

Vision-Language Models (VLMs) typically rely on static initial frames for video reasoning, restricting their ability to incorporate essential dynamic information as the reasoning process evolves. Existing methods that augment Chain-of-Thought (CoT) with additional frame information often exhibit suboptimal CoT quality and lack the crucial ability to synthesize visual information for hypothetical or counterfactual scenarios. We introduce Act-to-See (Act2See), a novel framework that enables active visual perception by empowering VLMs to actively interleave video frames within text CoTs. Act2See is developed via Supervised Fine-Tuning (SFT) on a high-quality dataset of reasoning traces generated by a frontier VLM. These traces integrate active calls to either retrieve existing frames or generate new ones, and are rigorously verified against human-annotated CoTs to ensure quality. This approach cultivates an emergent capability: at inference time, the model actively determines when to search for or synthesize the necessary visual evidence. Act2See establishes new state-of-the-art results on challenging benchmarks, including VideoEspresso and ViTIB, and outperforms comparable or larger models on Video-MME, EgoNormia, and VCR-Bench, demonstrating an advancement in enabling VLMs with active visual perception for video reasoning.

Problem

Research questions and friction points this paper is trying to address.

video reasoning

vision-language models

active visual perception

Chain-of-Thought

dynamic visual information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Visual Perception

Vision-Language Models

Chain-of-Thought