Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Multimodal large language models (MLLMs) lack active visual perception capabilities, leading to inefficient critical information search and inaccurate region localization. Method: We propose ACTIVE-O3—a novel framework that systematically formalizes active visual perception for MLLMs and models “where and how to look” as an end-to-end learnable visual policy. Leveraging GRPO, we design a pure reinforcement learning training paradigm with a structured action space comprising zoom, pan, and focus operations. We further introduce a comprehensive benchmark spanning open-world, remote sensing, and autonomous driving scenarios. Contribution/Results: Experiments demonstrate significant improvements over GPT-4o on small-object detection, dense object grounding, and fine-grained interactive segmentation. Moreover, the model exhibits strong zero-shot generalization, maintaining robust reasoning performance when transferred to the V* Benchmark. The codebase and evaluation protocols are publicly released.

Technology Category

Application Category

📝 Abstract

Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention. However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities. In this paper, we first provide a systematic definition of MLLM-based active perception tasks. We point out that the recently proposed GPT-o3 model's zoom-in search strategy can be regarded as a special case of active perception; however, it still suffers from low search efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-O3, a purely reinforcement learning based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks, such as small-object and dense object grounding, and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot reasoning abilities on the V* Benchmark, without relying on any explicit reasoning data. We hope that our work can provide a simple codebase and evaluation protocol to facilitate future research on active perception in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Enabling MLLMs to learn active perception for better decision-making

Improving search efficiency and accuracy in GPT-o3's zoom-in strategy

Establishing benchmarks for evaluating active perception in diverse scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

GRPO-based reinforcement learning framework

Enhances MLLMs with active perception

Comprehensive benchmark for evaluation

🔎 Similar Papers

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models