🤖 AI Summary
This work addresses the critical challenge of hallucination in large language models—generative outputs lacking factual or contextual grounding—and proposes an efficient internal detection mechanism. It reveals, for the first time, an intrinsic link between hallucinations and attention sinks, arguing that hallucinations arise when the model shifts from input-dependent to prior-dominated computation. Building on this insight, the authors introduce a knowledge-free detection approach that identifies attention sinks via attention maps and constructs a novel detection signal using the norm of value vectors, enabling a lightweight classifier to distinguish hallucinated content. The method achieves state-of-the-art performance across multiple mainstream large language models and benchmark datasets, while also offering a theoretical explanation for the implicit reliance of existing techniques on attention sinks.
📝 Abstract
Large language models frequently exhibit hallucinations: fluent and confident outputs that are factually incorrect or unsupported by the input context. While recent hallucination detection methods have explored various features derived from attention maps, the underlying mechanisms they exploit remain poorly understood. In this work, we propose SinkProbe, a hallucination detection method grounded in the observation that hallucinations are deeply entangled with attention sinks - tokens that accumulate disproportionate attention mass during generation - indicating a transition from distributed, input-grounded attention to compressed, prior-dominated computation. Importantly, although sink scores are computed solely from attention maps, we find that the classifier preferentially relies on sinks whose associated value vectors have large norms. Moreover, we show that previous methods implicitly depend on attention sinks by establishing their mathematical relationship to sink scores. Our findings yield a novel hallucination detection method grounded in theory that produces state-of-the-art results across popular datasets and LLMs.