The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

📅 2026-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether language models internally represent their own correctness when generating false statements. From a geometric perspective, the authors reveal that such representations concentrate in low-dimensional subspaces and manifest as mean shifts. Building on this insight, they propose a few-shot detection method that requires no training of complex probes: predictions are made by comparing the distance of activation vectors to centroids derived from correct and incorrect examples. Evaluated across nine mainstream language models, the approach achieves internal probe AUC scores of 0.80–0.97, substantially outperforming output-confidence-based baselines (0.44–0.64). Remarkably, only 25 labeled samples suffice to reach 90% of full-data performance, and targeted manipulation of activations induces a 10.9 percentage-point change in error rates, demonstrating both the efficacy and controllability of these internal representations.

Technology Category

Application Category

📝 Abstract
When a language model asserts that"the capital of Australia is Sydney,"does it know this is wrong? We characterize the geometry of correctness representations across 9 models from 5 architecture families. The structure is simple: the discriminative signal occupies 3-8 dimensions, performance degrades with additional dimensions, and no nonlinear classifier improves over linear separation. Centroid distance in the low-dimensional subspace matches trained probe performance (0.90 AUC), enabling few-shot detection: on GPT-2, 25 labeled examples achieve 89% of full-data accuracy. We validate causally through activation steering: the learned direction produces 10.9 percentage point changes in error rates while random directions show no effect. Internal probes achieve 0.80-0.97 AUC; output-based methods (P(True), semantic entropy) achieve only 0.44-0.64 AUC. The correctness signal exists internally but is not expressed in outputs. That centroid distance matches probe performance indicates class separation is a mean shift, making detection geometric rather than learned.
Problem

Research questions and friction points this paper is trying to address.

correctness representations
language models
confidence detection
geometric structure
internal probes
Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence manifold
correctness representation
low-dimensional geometry
activation steering
linear separability
🔎 Similar Papers
No similar papers found.