🤖 AI Summary
This study investigates whether large language models genuinely reflect their internal reasoning processes during chain-of-thought inference, particularly their capacity to detect their own errors. Employing linear probing, activation interventions (including activation steering and patching), self-correction mechanisms, and probe-guided optimal sampling, the work provides the first quantitative evidence that hidden model states can predict reasoning correctness with high accuracy (AUROC of 0.95), substantially outperforming surface-level textual classifiers (AUROC of 0.59). Despite this strong diagnostic signal, multiple intervention strategies fail to leverage it for error correction, indicating that the signal is correlative rather than causal. These findings reveal that models possess a latent ability to sense their own mistakes, thereby establishing new boundaries for mechanistic interpretability in language models.
📝 Abstract
Chain-of-thought (CoT) prompting assumes that generated reasoning reflects a model's internal computation. We show this assumption is wrong in a specific, measurable way: models internally detect their own reasoning errors but outwardly express confidence in them. A linear probe on hidden states predicts trace correctness with 0.95 AUROC -- from the very first reasoning step (0.79) -- while verbalized confidence for wrong traces is 4.55/5, nearly identical to correct ones (4.87/5). A text-surface classifier achieves only 0.59 on the same data, confirming a 0.20-point gap invisible in the generated text. This hidden error awareness holds across three model families (Qwen, Llama, Phi), 1.5B-72B parameters, and RL-trained reasoning models (DeepSeek-R1, 0.852 AUROC). The natural question is whether this signal can fix the errors it detects. It cannot. Four interventions -- activation steering, probe-guided best-of-N, self-correction, and activation patching -- all fail; patching destroys output coherence entirely. The signal is diagnostic, not causal: a readout of computation quality, not a lever to redirect it. This delineates a boundary for mechanistic interpretability: error representations during reasoning are fundamentally different from the factual knowledge representations that prior work has successfully edited.