Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing vision-language process reward models struggle to distinguish between reasoning errors and visual perception ambiguities, often misjudging hallucinated premises or penalizing valid image-dependent reasoning. To resolve this, the authors propose an Explicit Visual Premise Verification (EVPV) mechanism that generates a visual checklist and aligns it with structured constraints extracted from the image, thereby applying reliability-gated calibration to vision-dependent reasoning steps. EVPV is the first approach to explicitly integrate visual premise verification into process reward modeling, effectively disentangling perceptual uncertainty from logical evaluation without relying on external tools. Experiments demonstrate that EVPV significantly improves step-level verification accuracy and Best-of-N reranking performance on VisualProcessBench and six additional benchmarks, with controlled perturbation studies confirming that its gains stem from high-fidelity constraint matching.

Technology Category

Application Category

📝 Abstract
Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier's misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), undermining both reranking and error localization. We introduce Explicit Visual Premise Verification (EVPV), a lightweight verification interface that conditions step scoring on the reliability of the visual premises a step depends on. The policy is prompted to produce a step-wise visual checklist that makes required visual facts explicit, while a constraint extractor independently derives structured visual constraints from the input image. EVPV matches checklist claims against these constraints to compute a scalar visual reliability signal, and calibrates PRM step rewards via reliability gating: rewards for visually dependent steps are attenuated when reliability is low and preserved when reliability is high. This decouples perceptual uncertainty from logical evaluation without per-step tool calls. Experiments on VisualProcessBench and six multimodal reasoning benchmarks show that EVPV improves step-level verification and consistently boosts Best-of-N reranking accuracy over strong baselines. Furthermore, injecting controlled corruption into the extracted constraints produces monotonic performance degradation, providing causal evidence that the gains arise from constraint fidelity and explicit premise verification rather than incidental prompt effects. Code is available at: https://github.com/Qwen-Applications/EVPV-PRM
Problem

Research questions and friction points this paper is trying to address.

vision-language process reward models
visual premise verification
false positives
false negatives
perceptual uncertainty
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explicit Visual Premise Verification
Vision-Language Process Reward Models
Reliability Gating
Visual Constraint Extraction
Step-wise Verification
🔎 Similar Papers
No similar papers found.
J
Junxin Wang
Qwen Large Model Application Team, Alibaba
D
Dai Guan
Qwen Large Model Application Team, Alibaba
W
Weijie Qiu
Beijing University of Posts and Telecommunications
Zhihang Li
Zhihang Li
Kwai Inc
Computer VisionGenerative modelvideo/image generationLLM
Y
Yongbo Gai
Qwen Large Model Application Team, Alibaba
Zhengyi Yang
Zhengyi Yang
Chinese Academy of Sciences
Medical Image ProcessingHaptic ModellingRapid Prototyping
Mengyu Zhou
Mengyu Zhou
Microsoft Research
Data analyticsNatural Language ProcessingNetwork ScienceHuman BehaviorsMobile & Ubiquitous Computing
E
Erchao Zhao
Qwen Large Model Application Team, Alibaba
X
Xiaoxi Jiang
Qwen Large Model Application Team, Alibaba
G
Guanjun Jiang
Qwen Large Model Application Team, Alibaba