🤖 AI Summary
Large language models (LLMs) frequently exhibit miscalibration—overconfidence or underconfidence—in factual question answering, undermining their reliable deployment. To address this, we propose an unsupervised reinforcement learning (RL) framework for confidence calibration: we formalize calibration as a betting game and design a theoretically grounded dual-penalty reward function that jointly penalizes over- and under-confidence based on optimal calibration theory. Using the PPO algorithm, we jointly optimize answer generation and confidence score prediction. Crucially, our method is the first to explicitly embed theoretical optimality conditions for calibration into the RL reward, requiring no human-annotated confidence labels, enabling model-intrinsic calibration, and supporting zero-shot cross-task generalization. Experiments across multiple benchmarks demonstrate a 42% average reduction in Expected Calibration Error (ECE); notably, strong calibration persists even on unseen tasks, confirming the trainability of intrinsic confidence awareness in LLMs.
📝 Abstract
A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We introduce a novel Reinforcement Learning (RL) approach for LLM calibration that fine-tunes LLMs to elicit calibrated confidence estimations in their answers to factual questions. We model the problem as a betting game where the model predicts a confidence score together with every answer, and design a reward function that penalizes both over and under-confidence. We prove that under our reward design an optimal policy would result in a perfectly calibrated confidence estimation. Our experiments demonstrate significantly improved confidence calibration and generalization to new tasks without re-training, indicating that our approach teaches a general confidence awareness. This approach enables the training of inherently calibrated LLMs.