Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Large language models (LLMs) frequently generate factually incorrect content, yet it remains unclear whether they possess an intrinsic, real-time awareness of factual accuracy during generation. Method: We introduce the first empirical evidence of “fact self-awareness” in LLMs—using linear probing on Transformer residual streams to identify decodable, intermediate-layer features that predict recall accuracy of entity–relation–attribute triples during autoregressive generation. Contribution/Results: This capability exhibits format robustness, peaks in middle transformer layers, and emerges early in training. By integrating contextual perturbation analysis, cross-scale model evaluation, and training dynamics tracking, we significantly enhance LLMs’ factual controllability and interpretability. Our findings establish a novel mechanism for trustworthy generation and provide diagnostic tools for real-time factual monitoring—advancing both safety-critical applications and mechanistic interpretability research.

Technology Category

Application Category

📝 Abstract

Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs). Prior findings suggest LLMs can (sometimes) detect factual incorrectness in their generated content (i.e., fact-checking post-generation). In this work, we provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation. We demonstrate that for a given subject entity and a relation, LLMs internally encode linear features in the Transformer's residual stream that dictate whether it will be able to recall the correct attribute (that forms a valid entity-relation-attribute triplet). This self-awareness signal is robust to minor formatting variations. We investigate the effects of context perturbation via different example selection strategies. Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers. These findings uncover intrinsic self-monitoring capabilities within LLMs, contributing to their interpretability and reliability.

Problem

Research questions and friction points this paper is trying to address.

Detecting factual incorrectness in LLM-generated content

Identifying internal features dictating factual recall accuracy

Assessing robustness and scaling of self-awareness in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs internally encode linear factual features

Self-awareness signal resists formatting variations

Self-monitoring peaks in intermediate layers

🔎 Similar Papers

FacLens: Transferable Probe for Foreseeing Non-Factuality in Large Language Models