Perception-R1: Pioneering Perception Policy with Reinforcement Learning

πŸ“… 2025-04-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work investigates the efficacy and limitations of reinforcement learning (RL) for post-training visual perception strategies in multimodal large language models (MLLMs). Addressing task-specific characteristics of perception, we propose the first task-adaptive RL framework driven by Generalized Reward Policy Optimization (GRPO), revealing perceptual complexity as the key determinant of RL gains. Built upon the Qwen2.5-VL-3B-Instruct architecture, our method employs multi-stage RL with scalable reward modeling, yielding improvements of 4.2%, 17.9%, and 4.2% on RefCOCO+, PixMo-Count, and PageOCR, respectively. Notably, it achieves a new state-of-the-art 31.9% AP on COCO2017 valβ€”the first such result for MLLM-based perception. Our core contribution is the establishment of the first RL-based post-training paradigm explicitly grounded in perceptual complexity, providing a reproducible pathway to advance the upper bound of MLLM visual perception capabilities.

Technology Category

Application Category

πŸ“ Abstract
Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in the context of visual perception. In this work, we return to the fundamentals and explore the effects of RL on different perception tasks. We observe that the perceptual complexity is a major factor in determining the effectiveness of RL. We also observe that reward design plays a crucial role in further approching the upper limit of model perception. To leverage these findings, we propose Perception-R1, a scalable RL framework using GRPO during MLLM post-training. With a standard Qwen2.5-VL-3B-Instruct, Perception-R1 achieves +4.2% on RefCOCO+, +17.9% on PixMo-Count, +4.2% on PageOCR, and notably, 31.9% AP on COCO2017 val for the first time, establishing a strong baseline for perception policy learning.
Problem

Research questions and friction points this paper is trying to address.

Exploring RL's role in visual perception tasks
Investigating perceptual complexity impact on RL effectiveness
Developing scalable RL framework for perception policy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rule-based reinforcement learning for perception policy
Scalable RL framework using GRPO in post-training
Reward design crucial for model perception limits
πŸ”Ž Similar Papers
No similar papers found.
E
En Yu
Huazhong University of Science and Technology
K
Kangheng Lin
Beijing University of Posts and Telecommunications
L
Liang Zhao
StepFun
J
Jisheng Yin
StepFun
Y
Yana Wei
Johns Hopkins University
Yuang Peng
Yuang Peng
Tsinghua University
Generative ModelMultimodal Learning
H
Haoran Wei
StepFun
Jianjian Sun
Jianjian Sun
Researcher of StepFun
LLMMulti-modal
C
Chunrui Han
StepFun
Zheng Ge
Zheng Ge
Senior Researcher, StepFun
Multimodal Models Perception and Reasoning
X
Xiangyu Zhang
StepFun
Daxin Jiang
Daxin Jiang
Co-Founder & CEO, StepFun Corporation
Deep LearningFoundation Models
J
Jingyu Wang
Beijing University of Posts and Telecommunications
Wenbing Tao
Wenbing Tao
Professor of School of Automation, Huazhong University of Science and Technology
image processingcomputer visionpattern recognition