Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) frequently generate factually incorrect content, yet it remains unclear whether they possess an intrinsic, real-time awareness of factual accuracy during generation. Method: We introduce the first empirical evidence of “fact self-awareness” in LLMs—using linear probing on Transformer residual streams to identify decodable, intermediate-layer features that predict recall accuracy of entity–relation–attribute triples during autoregressive generation. Contribution/Results: This capability exhibits format robustness, peaks in middle transformer layers, and emerges early in training. By integrating contextual perturbation analysis, cross-scale model evaluation, and training dynamics tracking, we significantly enhance LLMs’ factual controllability and interpretability. Our findings establish a novel mechanism for trustworthy generation and provide diagnostic tools for real-time factual monitoring—advancing both safety-critical applications and mechanistic interpretability research.

Technology Category

Application Category

📝 Abstract
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs). Prior findings suggest LLMs can (sometimes) detect factual incorrectness in their generated content (i.e., fact-checking post-generation). In this work, we provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation. We demonstrate that for a given subject entity and a relation, LLMs internally encode linear features in the Transformer's residual stream that dictate whether it will be able to recall the correct attribute (that forms a valid entity-relation-attribute triplet). This self-awareness signal is robust to minor formatting variations. We investigate the effects of context perturbation via different example selection strategies. Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers. These findings uncover intrinsic self-monitoring capabilities within LLMs, contributing to their interpretability and reliability.
Problem

Research questions and friction points this paper is trying to address.

Detecting factual incorrectness in LLM-generated content
Identifying internal features dictating factual recall accuracy
Assessing robustness and scaling of self-awareness in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs internally encode linear factual features
Self-awareness signal resists formatting variations
Self-monitoring peaks in intermediate layers
🔎 Similar Papers
No similar papers found.