Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited consistency and transferability of existing vision-language generative reward models, which stem from the lack of explicit optimization toward evaluation rubrics. To remedy this, the study introduces, for the first time, differentiable rubric-based quality feedback into reinforcement learning training. A lightweight proxy model (Proxy-SFT/Proxy-RL) is employed to predict human preference rankings grounded in the rubric, and its prediction accuracy serves as the reward signal to explicitly optimize rubric adherence during generation. Remarkably, with only approximately 50,000 samples, the proposed method achieves state-of-the-art performance on VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench—significantly outperforming existing approaches that use four times more data. Moreover, the learned rubrics demonstrate strong zero-shot transferability to unseen evaluators, enhancing reward accuracy at test time.

Technology Category

Application Category

📝 Abstract
Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy's prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at https://github.com/Qwen-Applications/Proxy-GRM.
Problem

Research questions and friction points this paper is trying to address.

generative reward models
vision-language models
rubric optimization
reward modeling
transferable evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proxy-Guided Critique
Transferable Rubrics
Generative Reward Models
Vision-Language Models
Reinforcement Learning
🔎 Similar Papers
No similar papers found.
W
Weijie Qiu
Qwen Large Model Application Team, Alibaba
D
Dai Guan
Qwen Large Model Application Team, Alibaba
J
Junxin Wang
Institute of Automation, Chinese Academy of Sciences
Zhihang Li
Zhihang Li
Kwai Inc
Computer VisionGenerative modelvideo/image generationLLM
Y
Yongbo Gai
Qwen Large Model Application Team, Alibaba
Mengyu Zhou
Mengyu Zhou
Microsoft Research
Data analyticsNatural Language ProcessingNetwork ScienceHuman BehaviorsMobile & Ubiquitous Computing
E
Erchao Zhao
Qwen Large Model Application Team, Alibaba
X
Xiaoxi Jiang
Qwen Large Model Application Team, Alibaba
G
Guanjun Jiang
Qwen Large Model Application Team, Alibaba