More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models

📅 2025-12-13

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Vision-language models (VLMs) trained via reinforcement learning with vision-reward (RLVR) suffer from two critical deficiencies—visual extraction distortion (i.e., missing or hallucinated details) and chain-of-thought (CoT) logical inconsistency—stemming from the inability of terminal rewards to supervise intermediate perceptual and reasoning processes. Method: We propose PeRL-VL, a decoupled perception–reasoning joint optimization framework. It introduces a novel *vision description reward* to explicitly supervise fidelity and completeness of image details, and separates textual reasoning fine-tuning into a dedicated *Reasoning SFT* stage—trained on logic-intensive tasks—to enhance CoT consistency. The method integrates RLVR, VLM self-description generation and evaluation, and multimodal distillation contrast. Results: On multimodal benchmarks, PeRL-VL achieves 68.8% Pass@1 accuracy (+5.5% over baseline), significantly outperforming standard RLVR, text-only reasoning fine-tuning, and GPT-4o distillation—marking the first approach enabling *coordinated, controllable optimization* of both perception and reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Reinforcement learning from verifiable rewards (RLVR) has recently been extended from text-only LLMs to vision-language models (VLMs) to elicit long-chain multimodal reasoning. However, RLVR-trained VLMs still exhibit two persistent failure modes: inaccurate visual extraction (missing or hallucinating details) and logically inconsistent chains-of-thought, largely because verifiable signals supervise only the final answer. We propose PeRL-VL (Perception and Reasoning Learning for Vision-Language Models), a decoupled framework that separately improves visual perception and textual reasoning on top of RLVR. For perception, PeRL-VL introduces a VLM-based description reward that scores the model's self-generated image descriptions for faithfulness and sufficiency. For reasoning, PeRL-VL adds a text-only Reasoning SFT stage on logic-rich chain-of-thought data, enhancing coherence and logical consistency independently of vision. Across diverse multimodal benchmarks, PeRL-VL improves average Pass@1 accuracy from 63.3% (base Qwen2.5-VL-7B) to 68.8%, outperforming standard RLVR, text-only reasoning SFT, and naive multimodal distillation from GPT-4o.

Problem

Research questions and friction points this paper is trying to address.

Improves visual extraction accuracy in vision-language models

Enhances logical consistency in multimodal reasoning chains

Addresses limitations of final-answer-only supervision in RLVR

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled framework improving visual perception and textual reasoning

VLM-based description reward for faithful and sufficient image descriptions

Text-only reasoning SFT stage for logic-rich chain-of-thought coherence

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling