Think Twice: Branch-and-Rethink Reasoning Reward Model

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing reward models (RMs) compress multidimensional quality assessments into a single scalar score, leading to judgment diffusion—distracted attention and superficial analysis. To address this, we propose the Branch-Refined Reward Model (BR-RM), the first RM to incorporate a “think-twice” mechanism: an initial coarse-grained evaluation generates structured reasoning trajectories, followed by adaptive branching and conditional re-evaluation that focuses on salient quality dimensions and enhances sensitivity to subtle errors. BR-RM is trained end-to-end via GRPO-style reinforcement learning combined with binary outcome rewards, ensuring full compatibility with standard RLHF pipelines. Evaluated on three challenging, cross-domain benchmarks, BR-RM achieves state-of-the-art performance, significantly outperforming prior methods while maintaining practicality and scalability.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-oncescoringintofocused, second-lookreasoning, BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains. The code and the model will be released soon.

Problem

Research questions and friction points this paper is trying to address.

Addresses judgment diffusion in single-pass reward models

Introduces two-turn reasoning to improve error sensitivity

Enhances reward modeling through focused hypothesis testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-turn reward model with branch-and-rethink reasoning

Adaptive branching selects critical dimensions for evaluation

Branch-conditioned rethinking tests hypotheses for focused analysis

🔎 Similar Papers

No similar papers found.