Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the challenge of error attribution in vision-language models, which often suffer from a “see-saw effect” between perception and reasoning capabilities that obscures the source of failures. The authors propose a reinforcement learning framework featuring a perception-reasoning disentangled architecture, augmented with Perception Verification (PV) and Structured Language Verification mechanisms. Central to their approach is a Modality-aware Credit Assignment (MoCA) strategy that enables independent supervision and precise attribution of errors to either perceptual or reasoning components. This method achieves the first demonstration of simultaneous improvement in both perception and reasoning within a single model, effectively mitigating the performance trade-off between these two capacities across diverse multimodal tasks.

📝 Abstract

Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

Problem

Research questions and friction points this paper is trying to address.

modality credit assignment

perception-reasoning synergy

vision-language reasoning

perception fidelity

bad seeing vs bad thinking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Perception Verification

Modality-Aware Credit Assignment

Vision-Language Reasoning