From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language reasoning frameworks treat visual operations as optional auxiliary tools, causing reasoning to diverge from image evidence and inducing hallucinations—termed “pseudo-visual reasoning.” To address this, we propose **Visual Rationalization**, a novel paradigm that for the first time models visual actions as fundamental reasoning units. Our approach employs end-to-end reinforcement learning with process supervision, goal-aligned stepwise reward shaping, and fine-grained credit assignment—ensuring models derive correct answers *for the right visual reasons*. Evaluated on a multi-perception–reasoning–hallucination benchmark, our method achieves state-of-the-art performance while substantially enhancing reasoning transparency, verifiability, and trustworthiness.

Technology Category

Application Category

📝 Abstract
Recent advances in vision-language reasoning underscore the importance of thinking with images, where models actively ground their reasoning in visual evidence. Yet, prevailing frameworks treat visual actions as optional tools, boosting metrics but leaving reasoning ungrounded and crops ineffective. This gap gives rise to the illusion of thinking with images: models seem visually grounded but rely on context-agnostic actions that neither refine perception nor guide reasoning toward correct answers. We address this problem by reframing visual actions as core reasoning primitives rather than optional tools, which we term visual rationalization, the visual analogue of textual Chain-of-Thought. Building on this insight, we propose Visual Rationale Learning (ViRL), an end-to-end paradigm that grounds training in the visual rationale itself. ViRL integrates (1) Process Supervision with ground-truth rationales, (2) Objective Alignment via step-level reward shaping, and (3) Fine-Grained Credit Assignment to distinguish correct, redundant, and erroneous actions. By ensuring each action contributes meaningfully to the reasoning chain, ViRL enables models to "get the right answer for the right visual reason". Trained purely with end-to-end RL, ViRL achieves state-of-the-art results across benchmarks spanning perception, hallucination, and reasoning. This work establishes visual rationalization as a task-agnostic, process-grounded paradigm for building transparent, verifiable, and trustworthy vision-language models.
Problem

Research questions and friction points this paper is trying to address.

Addresses illusion of visual reasoning in vision-language models
Reframes visual actions as core reasoning primitives
Proposes end-to-end learning to ground reasoning in visual evidence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual rationalization as core reasoning primitives
End-to-end training with process supervision and reward shaping
Fine-grained credit assignment for meaningful visual actions
🔎 Similar Papers
No similar papers found.
C
Changpeng Wang
Zhejiang University
H
Haozhe Wang
The Hong Kong University of Science and Technology
X
Xi Chen
The University of Hong Kong
J
Junhan Liu
Zhejiang University
T
Taofeng Xue
Meituan
Chong Peng
Chong Peng
Qingdao University
机器学习、计算机视觉
Donglian Qi
Donglian Qi
Zhejiang University
Power systemsControl
Fangzhen Lin
Fangzhen Lin
Unknown affiliation
Y
Yunfeng Yan
Zhejiang University