🤖 AI Summary
Existing generative reward models rely solely on outcome-based supervision, neglecting the quality of the reasoning process, which limits their preference modeling capacity and generalization performance. This work proposes ReflectRM, the first generative reward model that incorporates a self-reflection mechanism into reward modeling. Within a unified framework, ReflectRM jointly optimizes preferences over both final answers and intermediate reasoning steps, and produces its final prediction based on the most reliable self-reflection outcomes. The proposed approach substantially mitigates positional bias (by +10.2 points) and achieves an average accuracy gain of 3.7 points across four benchmarks when using Qwen3-4B, outperforming current state-of-the-art generative reward models.
📝 Abstract
Reward Models (RMs) are critical components in the Reinforcement Learning from Human Feedback (RLHF) pipeline, directly determining the alignment quality of Large Language Models (LLMs). Recently, Generative Reward Models (GRMs) have emerged as a superior paradigm, offering higher interpretability and stronger generalization than traditional scalar RMs. However, existing methods for GRMs focus primarily on outcome-level supervision, neglecting analytical process quality, which constrains their potential. To address this, we propose ReflectRM, a novel GRM that leverages self-reflection to assess analytical quality and enhance preference modeling. ReflectRM is trained under a unified generative framework for joint modeling of response preference and analysis preference. During inference, we use its self-reflection capability to identify the most reliable analysis, from which the final preference prediction is derived. Experiments across four benchmarks show that ReflectRM consistently improves performance, achieving an average accuracy gain of +3.7 on Qwen3-4B. Further experiments confirm that response preference and analysis preference are mutually reinforcing. Notably, ReflectRM substantially mitigates positional bias, yielding +10.2 improvement compared with leading GRMs and establishing itself as a more stable evaluator.