Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing reward models (RMs) lack evaluation protocols for interleaved multimodal sequences—i.e., text-image mixtures—representing a critical gap in multimodal alignment assessment. Method: We introduce MMRB2, the first comprehensive benchmark for interleaved multimodal RM evaluation, covering four task categories: text-to-image generation, image editing, interleaved generation, and multimodal reasoning. It comprises 4,000 expert-annotated preference pairs, generated via an ensemble-based filtering method ensuring high inter-annotator consistency. Contribution/Results: Experiments demonstrate strong correlation between MMRB2 scores and Best-of-N sampling performance, confirming its downstream predictive validity. On MMRB2, Gemini 3 Pro and Qwen3-VL-32B achieve 75–80% and 64% accuracy, respectively—significantly outperforming GPT-4o (59%). This work establishes the first standardized evaluation framework for interleaved multimodal reward modeling, enabling principled RM design, training, and alignment.

Technology Category

Application Category

📝 Abstract

Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to>90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.

Problem

Research questions and friction points this paper is trying to address.

Evaluates reward models for multimodal understanding and generation

Benchmarks performance on interleaved text-image tasks

Identifies gaps between AI and human judgment accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Multimodal RewardBench 2 benchmark for reward models

Evaluates reward models on multimodal understanding and generation tasks

Uses expert-annotated preference pairs from diverse models and agents

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs