🤖 AI Summary
Large language models (LLMs) suffer from miscalibrated confidence estimates, undermining their reliability. To address this, we propose a lightweight, fine-tuning-free calibration method that requires no additional training data: it injects targeted adversarial perturbations into the final hidden states, quantifies state stability as a proxy for answer correctness, and trains a lightweight binary classifier on this stability feature. The approach is architecture-agnostic and applicable to both multiple-choice and open-ended generation tasks. Evaluated on the MMLU and MMLU-Pro benchmarks across 8B–32B parameter models, our method significantly outperforms state-of-the-art baselines: Expected Calibration Error (ECE) decreases by 55%, Brier score drops by 21%, accuracy improves by 5 percentage points, and AUPRC and AUROC increase by 4 and 6 percentage points, respectively. This work establishes hidden-state stability as a novel, efficient, and general-purpose paradigm for confidence estimation in LLMs.
📝 Abstract
Miscalibration in Large Language Models (LLMs) undermines their reliability, highlighting the need for accurate confidence estimation. We introduce CCPS (Calibrating LLM Confidence by Probing Perturbed Representation Stability), a novel method analyzing internal representational stability in LLMs. CCPS applies targeted adversarial perturbations to final hidden states, extracts features reflecting the model's response to these perturbations, and uses a lightweight classifier to predict answer correctness. CCPS was evaluated on LLMs from 8B to 32B parameters (covering Llama, Qwen, and Mistral architectures) using MMLU and MMLU-Pro benchmarks in both multiple-choice and open-ended formats. Our results show that CCPS significantly outperforms current approaches. Across four LLMs and three MMLU variants, CCPS reduces Expected Calibration Error by approximately 55% and Brier score by 21%, while increasing accuracy by 5 percentage points, Area Under the Precision-Recall Curve by 4 percentage points, and Area Under the Receiver Operating Characteristic Curve by 6 percentage points, all relative to the strongest prior method. CCPS delivers an efficient, broadly applicable, and more accurate solution for estimating LLM confidence, thereby improving their trustworthiness.