🤖 AI Summary
This work addresses the inefficiency of conventional discriminative reward models, which require separate forward passes for each candidate response and thus struggle to support joint comparison across multiple responses. The authors propose a multi-response reward modeling approach that concatenates multiple responses with delimiters within a single forward pass, enabling N-way preference learning and direct comparative inference for the first time. Built upon a 4B-parameter vision-language backbone, the model employs LoRA fine-tuning and a lightweight MLP value head, trained end-to-end with cross-entropy loss. The method achieves state-of-the-art performance across six multimodal reward benchmarks, outperforming both larger generative and discriminative models. When integrated with GRPO, it significantly enhances policy model generation quality and training stability, and the study further introduces two new benchmarks: MR²Bench-Image and MR²Bench-Video.
📝 Abstract
We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient $N$-way preference learning. The multi-response design also yields up to $N\times$ wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable $N$-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR$^2$Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR$^2$Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR$^2$Bench-Image, MR$^2$Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.