Perception-R1: Pioneering Perception Policy with Reinforcement Learning

📅 2025-04-10

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work investigates the efficacy and limitations of reinforcement learning (RL) for post-training visual perception strategies in multimodal large language models (MLLMs). Addressing task-specific characteristics of perception, we propose the first task-adaptive RL framework driven by Generalized Reward Policy Optimization (GRPO), revealing perceptual complexity as the key determinant of RL gains. Built upon the Qwen2.5-VL-3B-Instruct architecture, our method employs multi-stage RL with scalable reward modeling, yielding improvements of 4.2%, 17.9%, and 4.2% on RefCOCO+, PixMo-Count, and PageOCR, respectively. Notably, it achieves a new state-of-the-art 31.9% AP on COCO2017 val—the first such result for MLLM-based perception. Our core contribution is the establishment of the first RL-based post-training paradigm explicitly grounded in perceptual complexity, providing a reproducible pathway to advance the upper bound of MLLM visual perception capabilities.

Technology Category

Application Category

📝 Abstract

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in the context of visual perception. In this work, we return to the fundamentals and explore the effects of RL on different perception tasks. We observe that the perceptual complexity is a major factor in determining the effectiveness of RL. We also observe that reward design plays a crucial role in further approching the upper limit of model perception. To leverage these findings, we propose Perception-R1, a scalable RL framework using GRPO during MLLM post-training. With a standard Qwen2.5-VL-3B-Instruct, Perception-R1 achieves +4.2% on RefCOCO+, +17.9% on PixMo-Count, +4.2% on PageOCR, and notably, 31.9% AP on COCO2017 val for the first time, establishing a strong baseline for perception policy learning.

Problem

Research questions and friction points this paper is trying to address.

Exploring RL's role in visual perception tasks

Investigating perceptual complexity impact on RL effectiveness

Developing scalable RL framework for perception policy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rule-based reinforcement learning for perception policy

Scalable RL framework using GRPO in post-training

Reward design crucial for model perception limits

🔎 Similar Papers

No similar papers found.