🤖 AI Summary
This work investigates the cognitive origin of hallucination in large language models (LLMs)—specifically, whether LLMs possess metacognitive self-awareness regarding their knowledge of entities. Method: Leveraging sparse autoencoders, we identify a sparse neural direction that encodes this self-knowledge awareness; we then combine targeted neuron intervention with causal attribution analysis to probe its functional role. Contribution/Results: We provide the first empirical evidence that LLMs intrinsically possess knowledge introspection capability. This direction exhibits cross-stage causal validity—from base models to dialogue-tuned models—and modulating downstream attention heads along it disrupts entity-attribute propagation, thereby controllably inducing either refusal-to-answer or hallucination. Our work not only localizes the computational pathway underlying knowledge awareness but also enables precise, interpretable, and intervention-based hallucination suppression or triggering—establishing a novel paradigm for building trustworthy, explainable, and controllable LLMs.
📝 Abstract
Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.