🤖 AI Summary
Quantifying the reliability of reward models (RMs) remains challenging due to the lack of computable, annotation-free evaluation metrics.
Method: We propose RETA—the first tractable, human-annotation-free reliability metric—defined as the average quality (under an oracle) of the top-η quantile of responses selected by an RM. We develop an end-to-end benchmarking pipeline enabling zero-cost evaluation of arbitrary RMs, incorporating quantile-based response filtering, oracle-based re-labeling, and statistical stability analysis, with adaptive selection of the optimal η.
Contribution/Results: Empirically validated on RLHF and rejection sampling pipelines, RETA demonstrates strong stability and discriminative power across diverse public and proprietary RMs. It significantly improves response selection fidelity and alignment with human preferences—particularly for unreliable RMs—thereby establishing a trustworthy, practical foundation for assessing RM reliability in LLM alignment.
📝 Abstract
The reward model (RM) that represents human preferences plays a crucial role in optimizing the outputs of large language models (LLMs), e.g., through reinforcement learning from human feedback (RLHF) or rejection sampling. However, a long challenge for RM is its uncertain reliability, i.e., LLM outputs with higher rewards may not align with actual human preferences. Currently, there is a lack of a convincing metric to quantify the reliability of RMs. To bridge this gap, we propose the extit{underline{R}eliable at underline{$eta$}} (RETA) metric, which directly measures the reliability of an RM by evaluating the average quality (scored by an oracle) of the top $eta$ quantile responses assessed by an RM. On top of RETA, we present an integrated benchmarking pipeline that allows anyone to evaluate their own RM without incurring additional Oracle labeling costs. Extensive experimental studies demonstrate the superior stability of RETA metric, providing solid evaluations of the reliability of various publicly available and proprietary RMs. When dealing with an unreliable RM, we can use the RETA metric to identify the optimal quantile from which to select the responses.