Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of ambiguous credit assignment between perception and reasoning modules in existing reinforcement learning approaches for multimodal reasoning, which stems from shared reward mechanisms and limits the precision of visual evidence extraction. To overcome this limitation, the paper introduces the PRCO framework, featuring a novel dual-role co-evolutionary architecture that decouples perception and reasoning through role-specific reward signals. These signals independently optimize evidence description generation and answer prediction, while a joint optimization strategy integrates utility- and outcome-based rewards. Evaluated across eight mainstream multimodal reasoning benchmarks, PRCO achieves an average accuracy improvement of over 7 percentage points, substantially outperforming current open-source RL fine-tuning methods and demonstrating consistent gains across varying model scales.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.
Problem

Research questions and friction points this paper is trying to address.

multimodal reasoning
perception bottleneck
credit assignment
reinforcement learning
visual evidence extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perception-Reasoning Coevolution
Multimodal Reasoning
Reinforcement Learning with Verifiable Rewards
Role-specific Reward
Credit Assignment
🔎 Similar Papers
No similar papers found.