Establishing Reliability Metrics for Reward Models in Large Language Models

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Quantifying the reliability of reward models (RMs) remains challenging due to the lack of computable, annotation-free evaluation metrics. Method: We propose RETA—the first tractable, human-annotation-free reliability metric—defined as the average quality (under an oracle) of the top-η quantile of responses selected by an RM. We develop an end-to-end benchmarking pipeline enabling zero-cost evaluation of arbitrary RMs, incorporating quantile-based response filtering, oracle-based re-labeling, and statistical stability analysis, with adaptive selection of the optimal η. Contribution/Results: Empirically validated on RLHF and rejection sampling pipelines, RETA demonstrates strong stability and discriminative power across diverse public and proprietary RMs. It significantly improves response selection fidelity and alignment with human preferences—particularly for unreliable RMs—thereby establishing a trustworthy, practical foundation for assessing RM reliability in LLM alignment.

Technology Category

Application Category

📝 Abstract

The reward model (RM) that represents human preferences plays a crucial role in optimizing the outputs of large language models (LLMs), e.g., through reinforcement learning from human feedback (RLHF) or rejection sampling. However, a long challenge for RM is its uncertain reliability, i.e., LLM outputs with higher rewards may not align with actual human preferences. Currently, there is a lack of a convincing metric to quantify the reliability of RMs. To bridge this gap, we propose the extit{underline{R}eliable at underline{$eta$}} (RETA) metric, which directly measures the reliability of an RM by evaluating the average quality (scored by an oracle) of the top $eta$ quantile responses assessed by an RM. On top of RETA, we present an integrated benchmarking pipeline that allows anyone to evaluate their own RM without incurring additional Oracle labeling costs. Extensive experimental studies demonstrate the superior stability of RETA metric, providing solid evaluations of the reliability of various publicly available and proprietary RMs. When dealing with an unreliable RM, we can use the RETA metric to identify the optimal quantile from which to select the responses.

Problem

Research questions and friction points this paper is trying to address.

Lack of metrics to quantify reward model reliability

Uncertain alignment between high-reward outputs and human preferences

Need for cost-effective benchmarking of reward models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes RETA metric for reward model reliability

Introduces benchmarking pipeline for RM evaluation

Uses oracle-scored top quantile response quality

🔎 Similar Papers

Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown

2024-10-01arXiv.orgCitations: 10

OpenAI

$380K – $445K • Offers Equity

San Francisco, CA, USA

Authors to Follow