Perception-Aware Policy Optimization for Multimodal Reasoning

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Current multimodal reasoning models driven by reinforcement learning suffer from inaccurate visual perception, limiting overall reasoning performance. To address this, we propose PAPO, a Perception–Reasoning joint optimization framework. Its core innovation lies in the first integration of perceptual awareness into a verifiable reinforcement learning reward framework, introducing an implicit perceptual loss formulated as KL divergence alongside a dual-entropy loss—both optimized end-to-end without auxiliary data, external rewards, or dedicated perception models. This design effectively mitigates perceptual errors and reward hacking, enabling deep synergy between perception and reasoning. On standard multimodal benchmarks, PAPO achieves an average improvement of 4.4%; on highly vision-dependent tasks, gains reach 8.0%. Crucially, perceptual error rates decrease by 30.5%, substantially enhancing visual grounding in reasoning.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.

Problem

Research questions and friction points this paper is trying to address.

Enhances multimodal reasoning by improving visual perception in RLVR

Addresses perception errors without extra data or external models

Integrates perception-aware supervision to boost vision-dependent tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends GRPO with Perception-Aware Policy Optimization

Uses Implicit Perception Loss via KL divergence

Improves multimodal reasoning without external models

🔎 Similar Papers

Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace