PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments

πŸ“… 2025-10-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current multimodal large language models (MLLMs) exhibit limited visual reasoning capabilities in static, fully observable settings, leading to suboptimal performance in real-world, partially observable, interactive environments. To address this, we propose the Active Visual Reasoning (AVR) paradigmβ€”a novel framework requiring embodied agents to actively acquire information via physical interactions (e.g., navigation, manipulation), thereby closing the perception-reasoning-action loop. We introduce CLEVR-AVR, the first benchmark for AVR, and AVR-152k, a large-scale dataset comprising diverse interactive visual reasoning scenarios. Furthermore, we design PhysVLM-AVR, a model integrating chain-of-thought annotations, action-conditioned information gain prediction, and high-order Markov decision process training. Extensive experiments demonstrate that PhysVLM-AVR achieves state-of-the-art performance on both active and passive visual reasoning tasks. Our analysis further uncovers a fundamental bottleneck in contemporary embodied agents: the inability to dynamically integrate evolving sensory information over time.

Technology Category

Application Category

πŸ“ Abstract
Visual reasoning in multimodal large language models (MLLMs) has primarily been studied in static, fully observable settings, limiting their effectiveness in real-world environments where information is often incomplete due to occlusion or limited field of view. Humans, in contrast, actively explore and interact with their environment-moving, examining, and manipulating objects-to gather information through a closed-loop process integrating perception, reasoning, and action. Inspired by this human capability, we introduce the Active Visual Reasoning (AVR) task, extending visual reasoning to partially observable, interactive environments. AVR necessitates agents to: (1) actively acquire information via sequential physical actions, (2) integrate observations across multiple steps for coherent reasoning, and (3) dynamically adjust decisions based on evolving visual feedback. To rigorously evaluate AVR, we introduce CLEVR-AVR, a simulation benchmark featuring multi-round interactive environments designed to assess both reasoning correctness and information-gathering efficiency. We present AVR-152k, a large-scale dataset that offers rich Chain-of-Thought (CoT) annotations detailing iterative reasoning for uncertainty identification, action-conditioned information gain prediction, and information-maximizing action selection, crucial for training agents in a higher-order Markov Decision Process. Building on this, we develop PhysVLM-AVR, an MLLM achieving state-of-the-art performance on CLEVR-AVR, embodied reasoning (OpenEQA, RoboVQA), and passive visual reasoning (GeoMath, Geometry30K). Our analysis also reveals that current embodied MLLMs, despite detecting information incompleteness, struggle to actively acquire and integrate new information through interaction, highlighting a fundamental gap in active reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Extends visual reasoning to partially observable interactive environments
Enables active information gathering through sequential physical actions
Integrates multi-step observations for dynamic reasoning with visual feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active visual reasoning in partially observable environments
Sequential physical actions for dynamic information acquisition
Chain-of-Thought annotations for iterative reasoning training
πŸ”Ž Similar Papers
No similar papers found.
W
Weijie Zhou
Beijing Jiaotong University
X
Xuantang Xiong
Tencent Robotics X & Futian Laboratory, Shenzhen
Yi Peng
Yi Peng
Bytedance
Machine LearningImage ProcessingVisualization
M
Manli Tao
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences
Chaoyang Zhao
Chaoyang Zhao
Institute of Automation, Chinese Academy of Sciences
computer vision
H
Honghui Dong
Beijing Jiaotong University
M
Ming Tang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences
J
Jinqiao Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences