🤖 AI Summary
Existing video multimodal reward modeling (MRM) evaluation benchmarks suffer from limited scale, narrow assessment dimensions, and insufficient model coverage. To address these limitations, we introduce VidRM-Bench—the first comprehensive benchmark dedicated to video understanding—spanning four core dimensions: perception, knowledge, reasoning, and safety. It comprises 1,563 high-quality video–text–response triplets, constructed via an AI-assisted curation pipeline. We systematically evaluate 28 state-of-the-art MRMs, including generative, discriminative, and semi-scalar variants. Key findings include: (i) reinforcement learning fine-tuning does not necessarily improve cross-modal generalization; (ii) architectural differences significantly impact reasoning scalability and frame-number sensitivity. Experimental results reveal substantial performance gaps: the strongest proprietary model, GPT-4o, achieves only 57.0% accuracy, while the top open-weight model, Qwen2.5-VL-72B, scores 53.3%, underscoring critical methodological bottlenecks and ample room for advancement.
📝 Abstract
Multimodal reward models (MRMs) play a crucial role in the training, inference, and evaluation of Large Vision Language Models (LVLMs) by assessing response quality. However, existing benchmarks for evaluating MRMs in the video domain suffer from a limited number and diversity of questions, a lack of comprehensive evaluation dimensions, and inadequate evaluation of diverse types of MRMs. To address these gaps, we introduce VideoRewardBench, the first comprehensive benchmark covering four core aspects of video understanding: perception, knowledge, reasoning, and safety. Through our AI-assisted data pipeline, we curate a high-quality preference dataset of 1,563 annotated samples, including 1,482 unique videos and 1,559 distinct questions--15 times the number found in the most question-rich prior benchmark. Each sample is a triplet consisting of a video-text prompt, a chosen response, and a rejected response. We also conduct a comprehensive evaluation across 28 multimodal reward models spanning three categories: generative, discriminative, and semi-scalar. Results show that even the top-performing model GPT-4o achieves only 57.0% overall accuracy, and the state-of-the-art open-source model Qwen2.5-VL-72B reaches merely 53.3%. Our analysis further reveals three key insights: (i) MRMs trained with reinforcement learning (RL) do not necessarily exhibit stronger cross-modal generalization than those trained without RL; (ii) except for discriminative MRMs, other types of MRMs across varying model capacities can benefit from inference-time scaling; and (iii) variations in input video frame count have different effects on different types of MRMs. We believe VideoRewardBench offers a challenging and valuable benchmark for advancing the evaluation and development of MRMs in the video domain.