🤖 AI Summary
This work addresses hallucination generation in large language models (LLMs). We propose the first “pre-decoding hallucination detection and hidden-state-guided intervention” paradigm, enabling early identification and correction of erroneous outputs prior to token emission. Methodologically, we design a lightweight Transformer-based binary classifier operating on decoder hidden states to learn the mapping between latent representations and factual consistency; further, we introduce a gradient-driven hidden-state correction mechanism and develop a plug-and-play intervention module compatible across Llama, Mistral, and Gemma architectures. Our core contribution is the first framework for hallucination anticipation at the hidden-state level—achieving both low latency (only +3.16 seconds per inference) and architectural generality. Experiments demonstrate an average pre-decoding detection accuracy exceeding 70%, and a 34.4% improvement in output factuality after intervention.
📝 Abstract
Language models (LMs) hallucinate. We inquire: Can we detect and mitigate hallucinations before they happen? This work answers this research question in the positive, by showing that the internal representations of LMs provide rich signals that can be used for this purpose. We introduce FactCheckMate, which preemptively detects hallucinations by learning a classifier that predicts whether the LM will hallucinate, based on the model's hidden states produced over the inputs, before decoding begins. If a hallucination is detected, FactCheckMate then intervenes, by adjusting the LM's hidden states such that the model will produce more factual outputs. FactCheckMate provides fresh insights that the inner workings of LMs can be revealed by their hidden states. Practically, both the detection and mitigation models in FactCheckMate are lightweight, adding little inference overhead; FactCheckMate proves a more efficient approach for mitigating hallucinations compared to many post-hoc alternatives. We evaluate FactCheckMate over LMs of different scales and model families (including Llama, Mistral, and Gemma), across a variety of QA datasets from different domains. Our results demonstrate the effectiveness of leveraging internal representations for early hallucination detection and mitigation, achieving over 70% preemptive detection accuracy. On average, outputs generated by LMs with intervention are 34.4% more factual compared to those without intervention. The average overhead difference in the inference time introduced by FactCheckMate is around 3.16 seconds.