Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning

๐Ÿ“… 2025-11-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing reinforcement learning-based verification reward (RLVR) methods for vision-language models (VLMs) evaluate only the final textual output, neglecting verification during the visual perception stageโ€”leading to visual hallucinations and reward hacking. To address this, we propose PEARL, a dual-branch collaborative framework that introduces, for the first time, a **perceptual evidence anchoring mechanism**: it constructs verifiable perceptual checkpoints via a curated checklist of visual subproblems and employs perceptual rewards as fidelity gates to jointly optimize perception and reasoning. Built upon RL frameworks such as GRPO and DAPO, PEARL leverages auxiliary rollouts to generate perceptual rewards, enabling multi-step perceptual validation. On benchmarks including MathVerse, PEARL achieves a 9.7% absolute improvement over standard baselines and a 6.6% gain over GRPO, significantly enhancing the reliability and accuracy of multimodal reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) and is now being applied to Vision-Language Models (VLMs). However, vanilla RLVR for VLMs verifies only the final textual output, critically neglecting the foundational step of visual perception. This oversight leads to visual hallucinations and reward hacking, as reasoning built upon flawed perception is inherently unreliable. To address this, we propose PEARL (Perceptual-Evidence Anchored Reinforced Learning), a dual-branch, perception-reasoning synergistic that strengthens multimodal reasoning by explicitly anchoring it to verified visual evidence. For each reasoning-oriented QA instance, PEARL first derive a perception checklist -- a set of perception-oriented sub-questions with verifiable answers that probe the model's understanding of key visual evidence. During training, auxiliary rollouts on this checklist yield a perceptual reward that both directly reinforces the model's perception ability and acts as a fidelity gate for reasoning. If the model passes the perception check, its policy update is biased towards evidence-anchored reasoning. Otherwise, the process is halted to prevent reasoning from flawed premises. PEARL can be seamlessly integrated with popular RL methods like GRPO and DAPO. Comprehensive experiments show PEARL achieves substantial gains on multimodal reasoning benchmarks, e.g., a +9.7% improvement over the baseline and +6.6% over GRPO on MathVerse.
Problem

Research questions and friction points this paper is trying to address.

Addresses visual hallucinations in multimodal reasoning models
Anchors reasoning to verified visual evidence to prevent errors
Enhances perception-reasoning synergy through dual-branch reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Anchors multimodal reasoning to verified visual evidence
Uses perception checklist for verifiable visual understanding
Integrates perceptual reward as fidelity gate for reasoning
๐Ÿ”Ž Similar Papers
No similar papers found.
C
Chi Zhang
School of Computer Science, Wuhan University
Haibo Qiu
Haibo Qiu
University of Sydney
Multimodal LLMVision and LanguageComputer Vision
Q
Qiming Zhang
The University of Sydney
Y
Yufei Xu
The University of Sydney
Z
Zhixiong Zeng
Meituan Inc
Siqi Yang
Siqi Yang
University of Electronic Science and Technology of China
Generative Speech EnhancementAutomatic Speech RecognitionDiffusion Models
P
Peng Shi
Meituan Inc
L
Lin Ma
Meituan Inc
J
Jing Zhang
School of Computer Science, Wuhan University