ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

📅 2024-10-07
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation of active perception capabilities in multimodal large language models (MLLMs). We propose ActiView, the first dedicated benchmark that reformulates visual question answering (VQA) as a goal-directed perception–reasoning closed loop under restricted field-of-view conditions, incorporating dynamic zooming and panning operations to explicitly model and quantify models’ ability to actively modulate perceptual behavior based on reasoning. Our key contributions are threefold: (1) the first formal definition and empirical evaluation of active perception in MLLMs; (2) the introduction of restricted-field VQA, treating perception actions as evaluable intermediate reasoning steps; and (3) a comprehensive, cross-model evaluation framework covering 30 state-of-the-art MLLMs. Experiments reveal substantial deficiencies in current models’ active perception abilities. ActiView provides a reproducible benchmark and diagnostic toolkit to advance research in this critical yet underexplored direction.

Technology Category

Application Category

📝 Abstract
Active perception, a crucial human capability, involves setting a goal based on the current understanding of the environment and performing actions to achieve that goal. Despite significant efforts in evaluating Multimodal Large Language Models (MLLMs), active perception has been largely overlooked. To address this gap, we propose a novel benchmark named ActiView to evaluate active perception in MLLMs. We focus on a specialized form of Visual Question Answering (VQA) that eases and quantifies the evaluation yet challenging for existing MLLMs. Meanwhile, intermediate reasoning behaviors of models are also discussed. Given an image, we restrict the perceptual field of a model, requiring it to actively zoom or shift its perceptual field based on reasoning to answer the question successfully. We conduct extensive evaluation over 30 models, including proprietary and open-source models, and observe that restricted perceptual fields play a significant role in enabling active perception. Results reveal a significant gap in the active perception capability of MLLMs, indicating that this area deserves more attention. We hope that ActiView could help develop methods for MLLMs to understand multimodal inputs in more natural and holistic ways.
Problem

Research questions and friction points this paper is trying to address.

Evaluating active perception in Multimodal Large Language Models
Assessing MLLMs' ability to adapt perceptual fields for VQA
Identifying gaps in MLLMs' active perception capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes ActiView benchmark for MLLM active perception
Uses restricted perceptual fields to evaluate reasoning
Tests 30 models revealing active perception gaps
🔎 Similar Papers
No similar papers found.
Z
Ziyue Wang
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
C
Chi Chen
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
Fuwen Luo
Fuwen Luo
Tsinghua University
Computer Science
Yurui Dong
Yurui Dong
复旦大学
NLP MultiModal AI LLM
Y
Yuan Zhang
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
Yuzhuang Xu
Yuzhuang Xu
Tsinghua University
Natural Language ProcessingEfficient AIMachine Learning
X
Xiaolong Wang
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
P
Peng Li
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
Y
Yang Liu
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China; Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China