DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

📅 2026-05-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

184K/year
🤖 AI Summary
This work addresses the susceptibility of existing single-step multimodal reward models to linguistic priors, which undermines their ability to reliably verify fine-grained visual content. To overcome this limitation, the authors propose a two-stage preference evaluation framework: in the planning stage, an instance-specific, neutral verification checklist is dynamically generated; in the execution stage, fine-grained verification is performed by jointly reasoning over the image and the associated question. This approach introduces dynamic scoring rules into multimodal reward modeling for the first time, enhancing evaluation reliability and generalization through the decoupling of planning and verification. Within a single multimodal large language model, multi-agent reinforcement learning is employed to jointly optimize a divergent planner and a checklist verifier. On VL-RewardBench, this method improves accuracy by 22.6 and 18.8 percentage points for Qwen3-VL 4B and 8B models, respectively, substantially outperforming rule-free baselines.
📝 Abstract
Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce $\textbf{DeltaRubric}$, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a $\textit{Disagreement Planner}$, the model generates a neutral, instance-specific verification checklist. Transitioning into a $\textit{Checklist Verifier}$, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by $\textbf{+22.6}$ (4B) and $\textbf{+18.8}$ (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Reward Modeling
Lazy Judging
Visual Reasoning
Rubric-based Evaluation
Preference Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

DeltaRubric
multimodal reward modeling
plan-and-execute
verification checklist
reinforcement learning
🔎 Similar Papers
No similar papers found.