π€ AI Summary
This work investigates the efficacy and limitations of reinforcement learning (RL) for post-training visual perception strategies in multimodal large language models (MLLMs). Addressing task-specific characteristics of perception, we propose the first task-adaptive RL framework driven by Generalized Reward Policy Optimization (GRPO), revealing perceptual complexity as the key determinant of RL gains. Built upon the Qwen2.5-VL-3B-Instruct architecture, our method employs multi-stage RL with scalable reward modeling, yielding improvements of 4.2%, 17.9%, and 4.2% on RefCOCO+, PixMo-Count, and PageOCR, respectively. Notably, it achieves a new state-of-the-art 31.9% AP on COCO2017 valβthe first such result for MLLM-based perception. Our core contribution is the establishment of the first RL-based post-training paradigm explicitly grounded in perceptual complexity, providing a reproducible pathway to advance the upper bound of MLLM visual perception capabilities.
π Abstract
Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in the context of visual perception. In this work, we return to the fundamentals and explore the effects of RL on different perception tasks. We observe that the perceptual complexity is a major factor in determining the effectiveness of RL. We also observe that reward design plays a crucial role in further approching the upper limit of model perception. To leverage these findings, we propose Perception-R1, a scalable RL framework using GRPO during MLLM post-training. With a standard Qwen2.5-VL-3B-Instruct, Perception-R1 achieves +4.2% on RefCOCO+, +17.9% on PixMo-Count, +4.2% on PageOCR, and notably, 31.9% AP on COCO2017 val for the first time, establishing a strong baseline for perception policy learning.