DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the susceptibility of existing single-step multimodal reward models to linguistic priors, which undermines their ability to reliably verify fine-grained visual content. To overcome this limitation, the authors propose a two-stage preference evaluation framework: in the planning stage, an instance-specific, neutral verification checklist is dynamically generated; in the execution stage, fine-grained verification is performed by jointly reasoning over the image and the associated question. This approach introduces dynamic scoring rules into multimodal reward modeling for the first time, enhancing evaluation reliability and generalization through the decoupling of planning and verification. Within a single multimodal large language model, multi-agent reinforcement learning is employed to jointly optimize a divergent planner and a checklist verifier. On VL-RewardBench, this method improves accuracy by 22.6 and 18.8 percentage points for Qwen3-VL 4B and 8B models, respectively, substantially outperforming rule-free baselines.

📝 Abstract

Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce $\textbf{DeltaRubric}$, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a $\textit{Disagreement Planner}$, the model generates a neutral, instance-specific verification checklist. Transitioning into a $\textit{Checklist Verifier}$, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by $\textbf{+22.6}$ (4B) and $\textbf{+18.8}$ (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Reward Modeling

Lazy Judging

Visual Reasoning

Rubric-based Evaluation

Preference Evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

DeltaRubric

multimodal reward modeling

plan-and-execute