The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study investigates whether language models internally represent their own correctness when generating false statements. From a geometric perspective, the authors reveal that such representations concentrate in low-dimensional subspaces and manifest as mean shifts. Building on this insight, they propose a few-shot detection method that requires no training of complex probes: predictions are made by comparing the distance of activation vectors to centroids derived from correct and incorrect examples. Evaluated across nine mainstream language models, the approach achieves internal probe AUC scores of 0.80–0.97, substantially outperforming output-confidence-based baselines (0.44–0.64). Remarkably, only 25 labeled samples suffice to reach 90% of full-data performance, and targeted manipulation of activations induces a 10.9 percentage-point change in error rates, demonstrating both the efficacy and controllability of these internal representations.

Technology Category

Application Category

📝 Abstract

When a language model asserts that"the capital of Australia is Sydney,"does it know this is wrong? We characterize the geometry of correctness representations across 9 models from 5 architecture families. The structure is simple: the discriminative signal occupies 3-8 dimensions, performance degrades with additional dimensions, and no nonlinear classifier improves over linear separation. Centroid distance in the low-dimensional subspace matches trained probe performance (0.90 AUC), enabling few-shot detection: on GPT-2, 25 labeled examples achieve 89% of full-data accuracy. We validate causally through activation steering: the learned direction produces 10.9 percentage point changes in error rates while random directions show no effect. Internal probes achieve 0.80-0.97 AUC; output-based methods (P(True), semantic entropy) achieve only 0.44-0.64 AUC. The correctness signal exists internally but is not expressed in outputs. That centroid distance matches probe performance indicates class separation is a mean shift, making detection geometric rather than learned.

Problem

Research questions and friction points this paper is trying to address.

correctness representations

language models

confidence detection

geometric structure

internal probes

Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence manifold

correctness representation

low-dimensional geometry