🤖 AI Summary
Existing reward models (RMs) compress multidimensional quality assessments into a single scalar score, leading to judgment diffusion—distracted attention and superficial analysis. To address this, we propose the Branch-Refined Reward Model (BR-RM), the first RM to incorporate a “think-twice” mechanism: an initial coarse-grained evaluation generates structured reasoning trajectories, followed by adaptive branching and conditional re-evaluation that focuses on salient quality dimensions and enhances sensitivity to subtle errors. BR-RM is trained end-to-end via GRPO-style reinforcement learning combined with binary outcome rewards, ensuring full compatibility with standard RLHF pipelines. Evaluated on three challenging, cross-domain benchmarks, BR-RM achieves state-of-the-art performance, significantly outperforming prior methods while maintaining practicality and scalability.
📝 Abstract
Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-oncescoringintofocused, second-lookreasoning, BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains. The code and the model will be released soon.