Rewarding Doubt: A Reinforcement Learning Approach to Confidence Calibration of Large Language Models

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Large language models (LLMs) frequently exhibit miscalibration—overconfidence or underconfidence—in factual question answering, undermining their reliable deployment. To address this, we propose an unsupervised reinforcement learning (RL) framework for confidence calibration: we formalize calibration as a betting game and design a theoretically grounded dual-penalty reward function that jointly penalizes over- and under-confidence based on optimal calibration theory. Using the PPO algorithm, we jointly optimize answer generation and confidence score prediction. Crucially, our method is the first to explicitly embed theoretical optimality conditions for calibration into the RL reward, requiring no human-annotated confidence labels, enabling model-intrinsic calibration, and supporting zero-shot cross-task generalization. Experiments across multiple benchmarks demonstrate a 42% average reduction in Expected Calibration Error (ECE); notably, strong calibration persists even on unseen tasks, confirming the trainability of intrinsic confidence awareness in LLMs.

Technology Category

Application Category

📝 Abstract

A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We introduce a novel Reinforcement Learning (RL) approach for LLM calibration that fine-tunes LLMs to elicit calibrated confidence estimations in their answers to factual questions. We model the problem as a betting game where the model predicts a confidence score together with every answer, and design a reward function that penalizes both over and under-confidence. We prove that under our reward design an optimal policy would result in a perfectly calibrated confidence estimation. Our experiments demonstrate significantly improved confidence calibration and generalization to new tasks without re-training, indicating that our approach teaches a general confidence awareness. This approach enables the training of inherently calibrated LLMs.

Problem

Research questions and friction points this paper is trying to address.

Improves confidence calibration in Large Language Models.

Uses Reinforcement Learning to fine-tune confidence estimations.

Ensures accurate confidence scores for factual question answers.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning for LLM calibration

Betting game model for confidence estimation

Reward function penalizes over and under-confidence

🔎 Similar Papers

No similar papers found.