Closing the Confidence-Faithfulness Gap in Large Language Models

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the frequent misalignment between large language models’ stated confidence and their actual accuracy—a manifestation of poor calibration. Through mechanistic interpretability analysis, we discover that internal calibration signals and verbalized confidence signals are approximately orthogonal in the model’s linear representation space. We further identify a novel “reasoning contamination effect,” wherein the reasoning process interferes with faithful confidence expression. To mitigate this issue, we propose a two-stage adaptive prompting framework that integrates linear probing with Contrastive Activation Addition (CAA). Evaluated across three open-source large language models and four benchmark datasets, our approach significantly improves the consistency between expressed confidence and factual accuracy, effectively narrowing the gap between confidence and faithfulness.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the "Reasoning Contamination Effect." Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model's internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.

Problem

Research questions and friction points this paper is trying to address.

confidence-faithfulness gap

large language models

calibration

verbalized confidence

reasoning contamination

Innovation

Methods, ideas, or system contributions that make the work stand out.

mechanistic interpretability

confidence calibration

contrastive activation addition