🤖 AI Summary
Large language models are prone to generating hard-to-detect hallucinations in reasoning tasks, compromising their safety and reliability. This work reframes hallucination detection as an out-of-distribution (OOD) detection problem for the first time, modeling next-token prediction as a classification task. The authors propose a training-free, single-sample OOD detection method that seamlessly adapts to the architecture of language models. Leveraging a geometric perspective, the approach designs a novel OOD criterion that significantly outperforms existing techniques in reasoning scenarios, achieving high-accuracy hallucination identification. By doing so, it establishes a scalable new paradigm for enhancing the safety assurances of large language models.
📝 Abstract
Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. While existing hallucination detection methods achieve strong performance in question-answering tasks, they remain less effective on tasks requiring reasoning. In this work, we revisit hallucination detection through the lens of out-of-distribution (OOD) detection, a well-studied problem in areas like computer vision. Treating next-token prediction in language models as a classification task allows us to apply OOD techniques, provided appropriate modifications are made to account for the structural differences in large language models. We show that OOD-based approaches yield training-free, single-sample-based detectors, achieving strong accuracy in hallucination detection for reasoning tasks. Overall, our work suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.