Taming Overconfidence in LLMs: Reward Calibration in RLHF

📅 2024-10-13

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work identifies the root cause of “verbal overconfidence” in RLHF-trained large language models (LLMs): systemic reward model bias toward high-confidence responses, leading to a mismatch between model confidence and actual response quality. To address this, we propose two annotation-free PPO enhancements: PPO-M, which explicitly models and calibrates confidence within the reward function, and PPO-C, which dynamically adjusts reward computation via exponential moving averages and probabilistic calibration. Together, they jointly calibrate reward signals, response quality, and model confidence while remaining fully compatible with standard RLHF pipelines. Experiments on Llama3-8B and Mistral-7B demonstrate that our approach reduces expected calibration error (ECE) by up to 37%, achieves multi-task accuracy comparable to standard PPO, and preserves open-domain dialogue capability—validating both effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract

Language model calibration refers to the alignment between the confidence of the model and the actual performance of its responses. While previous studies point out the overconfidence phenomenon in Large Language Models (LLMs) and show that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) are overconfident with a more sharpened output probability, in this study, we reveal that RLHF tends to lead models to express verbalized overconfidence in their own responses. We investigate the underlying cause of this overconfidence and demonstrate that reward models used for Proximal Policy Optimization (PPO) exhibit inherent biases towards high-confidence scores regardless of the actual quality of responses. Building upon this insight, we propose two PPO variants: PPO-M: PPO with Calibrated Reward Modeling and PPO-C: PPO with Calibrated Reward Calculation. PPO-M integrates explicit confidence scores in reward model training, which calibrates reward models to better capture the alignment between response quality and verbalized confidence. PPO-C adjusts the reward score during PPO based on the difference between the current reward and the exponential average of past rewards. Both PPO-M and PPO-C can be seamlessly integrated into the current PPO pipeline and do not require additional golden labels. We evaluate our methods on both Llama3-8B and Mistral-7B across six diverse datasets including multiple-choice and open-ended generation. Experimental results demonstrate that both of our methods can reduce calibration error and maintain performance comparable to standard PPO. We further show that they could preserve model capabilities in open-ended conversational settings.

Problem

Research questions and friction points this paper is trying to address.

Addresses overconfidence in LLMs trained with RLHF

Identifies biases in reward models favoring high-confidence scores

Proposes PPO variants to calibrate reward modeling and calculation

Innovation

Methods, ideas, or system contributions that make the work stand out.

PPO-M integrates confidence scores in reward modeling.

PPO-C adjusts reward scores using exponential averaging.

Both methods reduce calibration error without extra labels.

🔎 Similar Papers

No similar papers found.