BaseReward: A Strong Baseline for Multimodal Reward Model

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Multimodal reward modeling (MRM) lacks systematic guidelines, hindering the alignment of multimodal large language models (MLLMs) with human preferences. Method: We propose BaseReward—a modular MRM framework built upon Qwen2.5-VL, featuring a lightweight two-layer reward head. It jointly trains on high-quality multimodal and pure-text preference data and incorporates ensemble learning to enhance robustness. Contribution/Results: This work establishes the first unified, reproducible MRM construction paradigm, enabling systematic ablation of modeling paradigms, architectural choices, and data strategies. BaseReward achieves state-of-the-art performance across three major benchmarks—MM-RLHF-Reward Bench, VL-Reward Bench, and Multimodal Reward Bench. Furthermore, it successfully drives RL-based optimization, yielding significant improvements in perception, reasoning, and dialogue capabilities of MLLMs.

Technology Category

Application Category

📝 Abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has made aligning them with human preferences a critical challenge. Reward Models (RMs) are a core technology for achieving this goal, but a systematic guide for building state-of-the-art Multimodal Reward Models (MRMs) is currently lacking in both academia and industry. Through exhaustive experimental analysis, this paper aims to provide a clear ``recipe'' for constructing high-performance MRMs. We systematically investigate every crucial component in the MRM development pipeline, including extit{reward modeling paradigms} (e.g., Naive-RM, Critic-based RM, and Generative RM), extit{reward head architecture}, extit{training strategies}, extit{data curation} (covering over ten multimodal and text-only preference datasets), extit{backbone model} and extit{model scale}, and extit{ensemble methods}. Based on these experimental insights, we introduce extbf{BaseReward}, a powerful and efficient baseline for multimodal reward modeling. BaseReward adopts a simple yet effective architecture, built upon a {Qwen2.5-VL} backbone, featuring an optimized two-layer reward head, and is trained on a carefully curated mixture of high-quality multimodal and text-only preference data. Our results show that BaseReward establishes a new SOTA on major benchmarks such as MM-RLHF-Reward Bench, VL-Reward Bench, and Multimodal Reward Bench, outperforming previous models. Furthermore, to validate its practical utility beyond static benchmarks, we integrate BaseReward into a real-world reinforcement learning pipeline, successfully enhancing an MLLM's performance across various perception, reasoning, and conversational tasks. This work not only delivers a top-tier MRM but, more importantly, provides the community with a clear, empirically-backed guide for developing robust reward models for the next generation of MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Lacking systematic guide for building multimodal reward models

Investigating key components in multimodal reward model development

Providing clear recipe for constructing high-performance reward models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Qwen2.5-VL backbone with optimized reward head

Trained on curated multimodal and text datasets

Establishes new SOTA across major benchmark evaluations

🔎 Similar Papers

No similar papers found.

OpenAI

$380K – $445K • Offers Equity

San Francisco, CA, USA

LLM Post-Training Engineer, Research & Product

TikTok

San Jose, California

Authors to Follow