VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

📅 2024-11-26
🏛️ arXiv.org
📈 Citations: 28
Influential: 6
📄 PDF
🤖 AI Summary
Existing evaluation methods for vision-language generative reward models (VL-GenRMs) rely on biased human preference labels and fail to rigorously challenge state-of-the-art models. Method: We introduce VL-RewardBench—the first high-challenge, purpose-built benchmark comprising 1,250 AI-screened and human-verified samples—covering multimodal queries, visual hallucination detection, and complex reasoning. We propose a capability-boundary-oriented evaluation framework, employing Best-of-N assessment, Pearson correlation analysis, and cross-model evaluation across 16 leading VL models. Contribution/Results: Our analysis reveals three key insights: (1) foundational visual perception is the primary bottleneck; (2) reasoning is not the main limitation; and (3) reasoning gains are strongly capacity-dependent. The “learning to judge” training paradigm significantly enhances judgment accuracy. Experiments show GPT-4o achieves only 65.4% accuracy, while most open-source SOTA models perform near chance level. VL-RewardBench scores correlate strongly with MMMU-Pro (r > 0.9), and 7B models improve accuracy by 14.7% after “learning to judge” training.

Technology Category

Application Category

📝 Abstract
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline that combines sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe VL-GenRMs limitations. Comprehensive evaluation across 16 leading large vision-language models demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r $>$ 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating vision-language generative reward models effectively
Addressing biases in current AI-annotated preference labels
Challenging state-of-the-art models with complex multimodal tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VL-RewardBench benchmark for VL-GenRMs
AI-assisted annotation with human verification
Training VL-GenRMs to judge boosts accuracy
🔎 Similar Papers