🤖 AI Summary
Large language models (LLMs) frequently generate factually incorrect content, yet it remains unclear whether they possess an intrinsic, real-time awareness of factual accuracy during generation. Method: We introduce the first empirical evidence of “fact self-awareness” in LLMs—using linear probing on Transformer residual streams to identify decodable, intermediate-layer features that predict recall accuracy of entity–relation–attribute triples during autoregressive generation. Contribution/Results: This capability exhibits format robustness, peaks in middle transformer layers, and emerges early in training. By integrating contextual perturbation analysis, cross-scale model evaluation, and training dynamics tracking, we significantly enhance LLMs’ factual controllability and interpretability. Our findings establish a novel mechanism for trustworthy generation and provide diagnostic tools for real-time factual monitoring—advancing both safety-critical applications and mechanistic interpretability research.
📝 Abstract
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs). Prior findings suggest LLMs can (sometimes) detect factual incorrectness in their generated content (i.e., fact-checking post-generation). In this work, we provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation. We demonstrate that for a given subject entity and a relation, LLMs internally encode linear features in the Transformer's residual stream that dictate whether it will be able to recall the correct attribute (that forms a valid entity-relation-attribute triplet). This self-awareness signal is robust to minor formatting variations. We investigate the effects of context perturbation via different example selection strategies. Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers. These findings uncover intrinsic self-monitoring capabilities within LLMs, contributing to their interpretability and reliability.