ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

📅 2024-10-07

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of active perception capabilities in multimodal large language models (MLLMs). We propose ActiView, the first dedicated benchmark that reformulates visual question answering (VQA) as a goal-directed perception–reasoning closed loop under restricted field-of-view conditions, incorporating dynamic zooming and panning operations to explicitly model and quantify models’ ability to actively modulate perceptual behavior based on reasoning. Our key contributions are threefold: (1) the first formal definition and empirical evaluation of active perception in MLLMs; (2) the introduction of restricted-field VQA, treating perception actions as evaluable intermediate reasoning steps; and (3) a comprehensive, cross-model evaluation framework covering 30 state-of-the-art MLLMs. Experiments reveal substantial deficiencies in current models’ active perception abilities. ActiView provides a reproducible benchmark and diagnostic toolkit to advance research in this critical yet underexplored direction.

Technology Category

Application Category

📝 Abstract

Active perception, a crucial human capability, involves setting a goal based on the current understanding of the environment and performing actions to achieve that goal. Despite significant efforts in evaluating Multimodal Large Language Models (MLLMs), active perception has been largely overlooked. To address this gap, we propose a novel benchmark named ActiView to evaluate active perception in MLLMs. We focus on a specialized form of Visual Question Answering (VQA) that eases and quantifies the evaluation yet challenging for existing MLLMs. Meanwhile, intermediate reasoning behaviors of models are also discussed. Given an image, we restrict the perceptual field of a model, requiring it to actively zoom or shift its perceptual field based on reasoning to answer the question successfully. We conduct extensive evaluation over 30 models, including proprietary and open-source models, and observe that restricted perceptual fields play a significant role in enabling active perception. Results reveal a significant gap in the active perception capability of MLLMs, indicating that this area deserves more attention. We hope that ActiView could help develop methods for MLLMs to understand multimodal inputs in more natural and holistic ways.

Problem

Research questions and friction points this paper is trying to address.

Evaluating active perception in Multimodal Large Language Models

Assessing MLLMs' ability to adapt perceptual fields for VQA

Identifying gaps in MLLMs' active perception capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes ActiView benchmark for MLLM active perception

Uses restricted perceptual fields to evaluate reasoning

Tests 30 models revealing active perception gaps

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs