Large Language Models Do NOT Really Know What They Don't Know

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) possess reliable “factual self-awareness”—i.e., the ability to distinguish factual outputs from hallucinations based on internal representations (hidden states, attention weights, token probabilities). Method: We propose a subject-knowledge-dependent hallucination classification framework and apply mechanistic interpretability techniques to systematically analyze geometric structures and distributional properties of representations during hallucination generation. Results: Hallucinations decoupled from subject knowledge are detectable, but knowledge-anchored hallucinations exhibit internal representations nearly indistinguishable from correct answers. This indicates LLMs encode patterns of knowledge retrieval—not truth signals—and that hallucinations stem from representational confusion rather than confidence miscalibration. Our key contribution is the first representation-level demonstration that LLMs lack robust factual metacognition, establishing a fundamental theoretical limit on hallucination detection.

Technology Category

Application Category

📝 Abstract

Recent work suggests that large language models (LLMs) encode factuality signals in their internal representations, such as hidden states, attention weights, or token probabilities, implying that LLMs may "know what they don't know". However, LLMs can also produce factual errors by relying on shortcuts or spurious associations. These error are driven by the same training objective that encourage correct predictions, raising the question of whether internal computations can reliably distinguish between factual and hallucinated outputs. In this work, we conduct a mechanistic analysis of how LLMs internally process factual queries by comparing two types of hallucinations based on their reliance on subject information. We find that when hallucinations are associated with subject knowledge, LLMs employ the same internal recall process as for correct responses, leading to overlapping and indistinguishable hidden-state geometries. In contrast, hallucinations detached from subject knowledge produce distinct, clustered representations that make them detectable. These findings reveal a fundamental limitation: LLMs do not encode truthfulness in their internal states but only patterns of knowledge recall, demonstrating that "LLMs don't really know what they don't know".

Problem

Research questions and friction points this paper is trying to address.

Investigating whether LLMs can distinguish factual from hallucinated outputs internally

Analyzing how subject knowledge affects detection of hallucinations in LLMs

Revealing LLMs encode recall patterns rather than truthfulness in internal states

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed LLM internal recall mechanisms for hallucinations

Compared subject-based versus detached hallucination representations

Found truthfulness not encoded but recall patterns detectable

🔎 Similar Papers

No similar papers found.