Calibrating LLM Confidence by Probing Perturbed Representation Stability

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Large language models (LLMs) suffer from miscalibrated confidence estimates, undermining their reliability. To address this, we propose a lightweight, fine-tuning-free calibration method that requires no additional training data: it injects targeted adversarial perturbations into the final hidden states, quantifies state stability as a proxy for answer correctness, and trains a lightweight binary classifier on this stability feature. The approach is architecture-agnostic and applicable to both multiple-choice and open-ended generation tasks. Evaluated on the MMLU and MMLU-Pro benchmarks across 8B–32B parameter models, our method significantly outperforms state-of-the-art baselines: Expected Calibration Error (ECE) decreases by 55%, Brier score drops by 21%, accuracy improves by 5 percentage points, and AUPRC and AUROC increase by 4 and 6 percentage points, respectively. This work establishes hidden-state stability as a novel, efficient, and general-purpose paradigm for confidence estimation in LLMs.

Technology Category

Application Category

📝 Abstract

Miscalibration in Large Language Models (LLMs) undermines their reliability, highlighting the need for accurate confidence estimation. We introduce CCPS (Calibrating LLM Confidence by Probing Perturbed Representation Stability), a novel method analyzing internal representational stability in LLMs. CCPS applies targeted adversarial perturbations to final hidden states, extracts features reflecting the model's response to these perturbations, and uses a lightweight classifier to predict answer correctness. CCPS was evaluated on LLMs from 8B to 32B parameters (covering Llama, Qwen, and Mistral architectures) using MMLU and MMLU-Pro benchmarks in both multiple-choice and open-ended formats. Our results show that CCPS significantly outperforms current approaches. Across four LLMs and three MMLU variants, CCPS reduces Expected Calibration Error by approximately 55% and Brier score by 21%, while increasing accuracy by 5 percentage points, Area Under the Precision-Recall Curve by 4 percentage points, and Area Under the Receiver Operating Characteristic Curve by 6 percentage points, all relative to the strongest prior method. CCPS delivers an efficient, broadly applicable, and more accurate solution for estimating LLM confidence, thereby improving their trustworthiness.

Problem

Research questions and friction points this paper is trying to address.

Improving LLM confidence calibration for reliability

Analyzing internal representational stability via adversarial perturbations

Reducing calibration errors and enhancing prediction accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial perturbations on hidden states

Lightweight classifier predicts correctness

Improves calibration across multiple LLMs

🔎 Similar Papers

Does Alignment Tuning Really Break LLMs' Internal Confidence?