What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the absence of fine-grained evaluation benchmarks for process reward models (PRMs) tailored to the “reasoning with visual grounding” paradigm, which hinders effective identification of diverse errors in visual reasoning. To bridge this gap, we construct the first PRM evaluation benchmark for this paradigm, comprising 1,206 human-annotated high-quality reasoning trajectories and defining seven fine-grained error categories. We systematically evaluate the capability of leading large vision-language models (LVLMs) to serve as PRMs in assessing reasoning processes. Our experiments reveal significant limitations: LVLM-based PRMs exhibit substantial performance variation across error types and are adversely affected by positivity bias and positional sensitivity, undermining their reliability in modeling process-level rewards.

Technology Category

Application Category

📝 Abstract
The rapid advancement of Large Vision Language Models (LVLMs) has demonstrated excellent abilities in various visual tasks. Building upon these developments, the thinking with images paradigm has emerged, enabling models to dynamically edit and re-encode visual information at each reasoning step, mirroring human visual processing. However, this paradigm introduces significant challenges as diverse errors may occur during reasoning processes. This necessitates Process Reward Models (PRMs) for distinguishing positive and negative reasoning steps, yet existing benchmarks for PRMs are predominantly text-centric and lack comprehensive assessment under this paradigm. To address these gaps, this work introduces the first comprehensive benchmark specifically designed for evaluating PRMs under the thinking with images paradigm. Our main contributions are: (1) Through extensive analysis of reasoning trajectories and guided search experiments with PRMs, we define 7 fine-grained error types and demonstrate both the necessity for specialized PRMs and the potential for improvement. (2) We construct a comprehensive benchmark comprising 1,206 manually annotated thinking with images reasoning trajectories spanning 4 categories and 16 subcategories for fine-grained evaluation of PRMs. (3) Our experimental analysis reveals that current LVLMs fall short as effective PRMs, exhibiting limited capabilities in visual reasoning process evaluation with significant performance disparities across error types, positive evaluation bias, and sensitivity to reasoning step positions. These findings demonstrate the effectiveness of our benchmark and establish crucial foundations for advancing PRMs in LVLMs.
Problem

Research questions and friction points this paper is trying to address.

Process Reward Models
Thinking with Images
Visual Reasoning
Error Types
Large Vision Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Models
Thinking with Images
Large Vision Language Models
Reasoning Trajectories
Visual Reasoning Evaluation
🔎 Similar Papers
No similar papers found.
Y
Yujin Zhou
Hong Kong University of Science and Technology
P
Pengcheng Wen
Hong Kong University of Science and Technology
J
Jiale Chen
Sun Yat-sen University
B
Boqin Yin
Hong Kong University of Science and Technology
H
Han Zhu
Hong Kong University of Science and Technology
J
Jiaming Ji
Peking University
J
Juntao Dai
Peking University
Chi-Min Chan
Chi-Min Chan
HKUST
Large Language ModelsPost-TrainingAlignmentLLM Agents
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence