Ask a Strong LLM Judge when Your Reward Model is Uncertain

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional reward models (RMs) in RLHF suffer from vulnerability to reward hacking and poor out-of-distribution (OOD) generalization; while strong LLM-based judges generalize well, they incur prohibitive inference costs. To address this trade-off, we propose an uncertainty-aware dynamic routing framework: leveraging entropy or prediction inconsistency to quantify RM confidence, and automatically routing high-confidence samples to a lightweight RM while delegating low-confidence samples to a stronger LLM judge. This is the first work to systematically integrate uncertainty estimation into sample routing for reward modeling. Implemented within a pairwise preference classification framework, our method enables efficient advantage estimation and remains compatible with policy gradient optimization. Experiments demonstrate that, under fixed computational budgets, our approach significantly outperforms random invocation baselines—improving both robustness and performance across multiple RM benchmarks and online RLHF downstream tasks.

Technology Category

Application Category

📝 Abstract
Reward model (RM) plays a pivotal role in reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs). However, classical RMs trained on human preferences are vulnerable to reward hacking and generalize poorly to out-of-distribution (OOD) inputs. By contrast, strong LLM judges equipped with reasoning capabilities demonstrate superior generalization, even without additional training, but incur significantly higher inference costs, limiting their applicability in online RLHF. In this work, we propose an uncertainty-based routing framework that efficiently complements a fast RM with a strong but costly LLM judge. Our approach formulates advantage estimation in policy gradient (PG) methods as pairwise preference classification, enabling principled uncertainty quantification to guide routing. Uncertain pairs are forwarded to the LLM judge, while confident ones are evaluated by the RM. Experiments on RM benchmarks demonstrate that our uncertainty-based routing strategy significantly outperforms random judge calling at the same cost, and downstream alignment results showcase its effectiveness in improving online RLHF.
Problem

Research questions and friction points this paper is trying to address.

Reward models generalize poorly to out-of-distribution inputs
Strong LLM judges have high inference costs limiting RLHF use
Uncertainty-based routing combines fast reward models with LLM judges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uncertainty-based routing combines fast reward model
Strong LLM judge handles uncertain preference pairs
Formulates advantage estimation as pairwise classification
🔎 Similar Papers
No similar papers found.