Establishing Reliability Metrics for Reward Models in Large Language Models

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Quantifying the reliability of reward models (RMs) remains challenging due to the lack of computable, annotation-free evaluation metrics. Method: We propose RETA—the first tractable, human-annotation-free reliability metric—defined as the average quality (under an oracle) of the top-η quantile of responses selected by an RM. We develop an end-to-end benchmarking pipeline enabling zero-cost evaluation of arbitrary RMs, incorporating quantile-based response filtering, oracle-based re-labeling, and statistical stability analysis, with adaptive selection of the optimal η. Contribution/Results: Empirically validated on RLHF and rejection sampling pipelines, RETA demonstrates strong stability and discriminative power across diverse public and proprietary RMs. It significantly improves response selection fidelity and alignment with human preferences—particularly for unreliable RMs—thereby establishing a trustworthy, practical foundation for assessing RM reliability in LLM alignment.

Technology Category

Application Category

📝 Abstract
The reward model (RM) that represents human preferences plays a crucial role in optimizing the outputs of large language models (LLMs), e.g., through reinforcement learning from human feedback (RLHF) or rejection sampling. However, a long challenge for RM is its uncertain reliability, i.e., LLM outputs with higher rewards may not align with actual human preferences. Currently, there is a lack of a convincing metric to quantify the reliability of RMs. To bridge this gap, we propose the extit{underline{R}eliable at underline{$eta$}} (RETA) metric, which directly measures the reliability of an RM by evaluating the average quality (scored by an oracle) of the top $eta$ quantile responses assessed by an RM. On top of RETA, we present an integrated benchmarking pipeline that allows anyone to evaluate their own RM without incurring additional Oracle labeling costs. Extensive experimental studies demonstrate the superior stability of RETA metric, providing solid evaluations of the reliability of various publicly available and proprietary RMs. When dealing with an unreliable RM, we can use the RETA metric to identify the optimal quantile from which to select the responses.
Problem

Research questions and friction points this paper is trying to address.

Lack of metrics to quantify reward model reliability
Uncertain alignment between high-reward outputs and human preferences
Need for cost-effective benchmarking of reward models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes RETA metric for reward model reliability
Introduces benchmarking pipeline for RM evaluation
Uses oracle-scored top quantile response quality
Yizhou Chen
Yizhou Chen
Peking University
AI4SEVulnerability DetectionFormal Verification
Yawen Liu
Yawen Liu
Shopee Pte. Ltd.
X
Xuesi Wang
Shopee Pte. Ltd.
Q
Qingtao Yu
Shopee Pte. Ltd.
G
Guangda Huzhang
Shopee Pte. Ltd.
A
Anxiang Zeng
Nanyang Technological University
H
Han Yu
Nanyang Technological University
Zhiming Zhou
Zhiming Zhou
Shanghai University of Finance and Economics
GeneralizationOptimizationGANsMachine LearningComputer Graphics