Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In RLHF, conventional Bradley-Terry reward models (BT RMs) suffer from data bias and reward hacking, while existing generative reward models (GenRMs) are limited by shallow vertical reasoning and produce pairwise outputs incompatible with standard pointwise RLHF algorithms. Method: We propose a generative reward model supporting long-range reasoning, introducing a novel self-guided reasoning modeling mechanism that enables self-reflection, hypothesis generation, and divergent reasoning. We further design a direct pairwise preference optimization training paradigm—bypassing pointwise conversion—and integrate chain-of-thought supervised fine-tuning, rule-driven reinforcement learning, and preference alignment optimization. Contribution/Results: Our approach achieves an 8% absolute improvement over BT RM and vertically scaled GenRM on RM-Bench, and yields significant end-to-end policy performance gains, demonstrating robustness to bias and enhanced reasoning fidelity.

Technology Category

Application Category

📝 Abstract
Reinforcement learning from human feedback (RLHF) has become a powerful post-training paradigm for aligning large language models with human preferences. A core challenge in RLHF is constructing accurate reward signals, where the conventional Bradley-Terry reward models (BT RMs) often suffer from sensitivity to data size and coverage, as well as vulnerability to reward hacking. Generative reward models (GenRMs) offer a more robust alternative by generating chain-of-thought (CoT) rationales followed by a final reward. However, existing GenRMs rely on shallow, vertically scaled reasoning, limiting their capacity to handle nuanced or complex (e.g., reasoning-intensive) tasks. Moreover, their pairwise preference outputs are incompatible with standard RLHF algorithms that require pointwise reward signals. In this work, we introduce Think-RM, a training framework that enables long-horizon reasoning in GenRMs by modeling an internal thinking process. Rather than producing structured, externally provided rationales, Think-RM generates flexible, self-guided reasoning traces that support advanced capabilities such as self-reflection, hypothetical reasoning, and divergent reasoning. To elicit these reasoning abilities, we first warm-up the models by supervised fine-tuning (SFT) over long CoT data. We then further improve the model's long-horizon abilities by rule-based reinforcement learning (RL). In addition, we propose a novel pairwise RLHF pipeline that directly optimizes policies using pairwise preference rewards, eliminating the need for pointwise reward conversion and enabling more effective use of Think-RM outputs. Experiments show that Think-RM achieves state-of-the-art results on RM-Bench, outperforming both BT RM and vertically scaled GenRM by 8%. When combined with our pairwise RLHF pipeline, it demonstrates superior end-policy performance compared to traditional approaches.
Problem

Research questions and friction points this paper is trying to address.

Enhancing long-horizon reasoning in generative reward models
Overcoming limitations of conventional reward models in RLHF
Enabling advanced reasoning capabilities like self-reflection and hypothetical reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative reward models with self-guided reasoning traces
Supervised fine-tuning over long chain-of-thought data
Pairwise RLHF pipeline optimizing policies directly
🔎 Similar Papers
No similar papers found.
Ilgee Hong
Ilgee Hong
Georgia Institute of Technology
Machine LearningLarge Language Models
C
Changlong Yu
Amazon
L
Liang Qiu
Amazon
Weixiang Yan
Weixiang Yan
Amazon
Code IntelligenceAgentic RLSoftware Automation
Z
Zhenghao Xu
Georgia Institute of Technology
Haoming Jiang
Haoming Jiang
OpenAI; Ex-Amazon; Georgia Institute of Technology
Machine Learning
Qingru Zhang
Qingru Zhang
Georgia Institute of Technology
Large Language ModelsLLM EfficiencyMachine Learning
Q
Qin Lu
Amazon
X
Xin Liu
Amazon
C
Chao Zhang
Georgia Institute of Technology
T
Tuo Zhao
Georgia Institute of Technology