ViLBench: A Suite for Vision-Language Process Reward Modeling

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Visual large language models (VLLMs) exhibit inconsistent performance as output or process reward models (ORMs/PRMs), and strong VLLM capabilities do not necessarily translate into effective reward modeling for vision-language reasoning. Method: We introduce ViLBench, the first benchmark dedicated to vision-language process reward modeling, featuring tasks requiring dense process-level supervision; it contains 73.6K high-quality process feedback samples generated via enhanced tree search. Contribution/Results: Our systematic evaluation of mainstream VLLMs reveals, for the first time, only weak correlation between their general-purpose capabilities and reward modeling accuracy. We propose a process supervision signal distillation method that significantly boosts lightweight reward models: a 3B model achieves +3.3% accuracy over standard chain-of-thought (CoT) and +2.5% over an untrained baseline on ViLBench. Notably, even the strongest current VLLM—GPT-4o with CoT—attains only 27.3% accuracy, underscoring ViLBench’s substantial challenge.

Technology Category

Application Category

📝 Abstract

Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations at https://ucsc-vlaa.github.io/ViLBench with our code, model, and data.

Problem

Research questions and friction points this paper is trying to address.

Evaluating vision-language process reward models (PRMs) lacks benchmarks.

Current VLLMs show inconsistent performance as reward models.

Developing a benchmark for intensive process reward signals is needed.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarks VLLMs as ORMs and PRMs

Introduces ViLBench for process rewards

Uses tree-search for reward data collection

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling