🤖 AI Summary
This work addresses the lack of systematic evaluation of active perception capabilities in multimodal large language models (MLLMs). We propose ActiView, the first dedicated benchmark that reformulates visual question answering (VQA) as a goal-directed perception–reasoning closed loop under restricted field-of-view conditions, incorporating dynamic zooming and panning operations to explicitly model and quantify models’ ability to actively modulate perceptual behavior based on reasoning. Our key contributions are threefold: (1) the first formal definition and empirical evaluation of active perception in MLLMs; (2) the introduction of restricted-field VQA, treating perception actions as evaluable intermediate reasoning steps; and (3) a comprehensive, cross-model evaluation framework covering 30 state-of-the-art MLLMs. Experiments reveal substantial deficiencies in current models’ active perception abilities. ActiView provides a reproducible benchmark and diagnostic toolkit to advance research in this critical yet underexplored direction.
📝 Abstract
Active perception, a crucial human capability, involves setting a goal based on the current understanding of the environment and performing actions to achieve that goal. Despite significant efforts in evaluating Multimodal Large Language Models (MLLMs), active perception has been largely overlooked. To address this gap, we propose a novel benchmark named ActiView to evaluate active perception in MLLMs. We focus on a specialized form of Visual Question Answering (VQA) that eases and quantifies the evaluation yet challenging for existing MLLMs. Meanwhile, intermediate reasoning behaviors of models are also discussed. Given an image, we restrict the perceptual field of a model, requiring it to actively zoom or shift its perceptual field based on reasoning to answer the question successfully. We conduct extensive evaluation over 30 models, including proprietary and open-source models, and observe that restricted perceptual fields play a significant role in enabling active perception. Results reveal a significant gap in the active perception capability of MLLMs, indicating that this area deserves more attention. We hope that ActiView could help develop methods for MLLMs to understand multimodal inputs in more natural and holistic ways.