CHARM: Calibrating Reward Models With Chatbot Arena Scores

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses model preference bias in reward models (RMs)—a systematic tendency to overrate responses from specific policy models, leading to distorted rankings and unfair evaluation of large language models. To mitigate this, we propose CHARM, a calibration framework grounded in Chatbot Arena Elo rankings. Our key contributions are threefold: (1) We formally define and quantify “Mismatch Degree,” a novel metric capturing RM-policy misalignment; (2) We design a lightweight, annotation-free, end-to-end calibration pipeline integrating Elo transfer, preference-data fine-tuning, and continual reward modeling; (3) Empirical evaluation on RM-Bench and RewardBench (Chat-Hard) shows that calibrated RMs achieve significantly higher accuracy, stronger correlation between reward scores and human Elo rankings, and a 37% reduction in model preference bias—thereby enhancing fairness in cross-model comparison and fidelity in human preference modeling.

Technology Category

Application Category

📝 Abstract
Reward models (RMs) play a crucial role in Reinforcement Learning from Human Feedback by serving as proxies for human preferences in aligning large language models. In this paper, we identify a model preference bias in RMs, where they systematically assign disproportionately high scores to responses from certain policy models. This bias distorts ranking evaluations and leads to unfair judgments. To address this issue, we propose a calibration method named CHatbot Arena calibrated Reward Modeling (CHARM) that leverages Elo scores from the Chatbot Arena leaderboard to mitigate RM overvaluation. We also introduce a Mismatch Degree metric to measure this preference bias. Our approach is computationally efficient, requiring only a small preference dataset for continued training of the RM. We conduct extensive experiments on reward model benchmarks and human preference alignment. Results demonstrate that our calibrated RMs (1) achieve improved evaluation accuracy on RM-Bench and the Chat-Hard domain of RewardBench, and (2) exhibit a stronger correlation with human preferences by producing scores more closely aligned with Elo rankings. By mitigating model preference bias, our method provides a generalizable and efficient solution for building fairer and more reliable reward models.
Problem

Research questions and friction points this paper is trying to address.

Addresses reward model bias in scoring certain policy models
Proposes calibration using Elo scores to improve fairness
Enhances human preference alignment in reward models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Calibrates reward models using Chatbot Arena Elo scores
Introduces Mismatch Degree metric for bias measurement
Requires small dataset for efficient continued training
🔎 Similar Papers
No similar papers found.
X
Xiao Zhu
HKUST (Guangzhou)
C
Chenmien Tan
Alibaba Group
Pinzhen Chen
Pinzhen Chen
University of Edinburgh
large language modelsLLM post-trainingmachine translationmultilinguality
R
Rico Sennrich
University of Zurich
Y
Yanlin Zhang
HKUST (Guangzhou)
Hanxu Hu
Hanxu Hu
University of Zurich, University of Edinburgh
Large Language ModelsMachine Learning