How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
This study addresses the unclear internal mechanisms by which large language models (LLMs) detect and correct their own errors without external feedback. Drawing on second-order confidence theory from decision neuroscience, the work proposes and validates that LLMs inherently possess an internal evaluation mechanism distinct from their generation process: activation signals at the post-answer newline location (PANL) effectively assess answer correctness and predict error reparability. Through a verify-then-revise paradigm, causal interventions, and cross-model, cross-task activation analyses, the research demonstrates that PANL signals consistently outperform conventional behavioral metrics. These findings are robustly replicated across Gemma 3 27B and Qwen 2.5 7B models on TriviaQA and MNLI tasks, offering the first evidence of an embedded second-order confidence architecture in LLMs.

Technology Category

Application Category

📝 Abstract
Large language models can detect their own errors and sometimes correct them without external feedback, but the underlying mechanisms remain unknown. We investigate this through the lens of second-order models of confidence from decision neuroscience. In a first-order system, confidence derives from the generation signal itself and is therefore maximal for the chosen response, precluding error detection. Second-order models posit a partially independent evaluative signal that can disagree with the committed response, providing the basis for error detection. Kumaran et al. (2026) showed that LLMs cache a confidence representation at a token immediately following the answer (i.e. post-answer newline: PANL) -- that causally drives verbal confidence and dissociates from log-probabilities. Here we test whether this PANL signal extends beyond confidence to support error detection and self-correction. Here we test whether this signal supports error detection and self-correction, deriving predictions from the second-order framework. Using a verify-then-correct paradigm, we show that: (i) verbal confidence predicts error detection far beyond token log-probabilities, ruling out a first-order account; (ii) PANL activations predict error detection beyond verbal confidence itself; and (iii) PANL predicts which errors the model can correct -- where all behavioural signals fail. Causal interventions confirm that PANL signals rescue error detection behavior when answer information is corrupted. All findings replicate across models (Gemma 3 27B and Qwen 2.5 7B) and tasks (TriviaQA and MNLI). These results reveal that LLMs naturally implement a second-order confidence architecture whose internal evaluative signal encodes not only whether an answer is likely wrong but whether the model has the knowledge to fix it.
Problem

Research questions and friction points this paper is trying to address.

large language models
error detection
self-correction
confidence signals
second-order models
Innovation

Methods, ideas, or system contributions that make the work stand out.

second-order confidence
PANL signal
error detection
self-correction
internal evaluative signal
🔎 Similar Papers
2024-10-03International Conference on Learning RepresentationsCitations: 28