🤖 AI Summary
Existing classification-based hallucination detection methods (e.g., SAPLMA) exhibit limited capability in identifying early- or mid-generation non-factual responses from large language models (LLMs). To address this, this work introduces neural differential equations (Neural DEs) to hallucination detection for the first time, modeling the continuous dynamical evolution of LLM hidden states during token generation and enabling sequence-level, temporally grounded truth reasoning. By explicitly characterizing latent-space trajectory dynamics and learning a time-aware mapping from the latent space to the classification space, our approach transcends the limitations of static, token- or span-wise classification. Extensive experiments across five benchmark datasets and six state-of-the-art LLMs demonstrate substantial performance gains: on the True-False dataset, our method achieves an AUC-ROC improvement of over 14 percentage points compared to prior art. This establishes a novel paradigm for fine-grained, time-aware hallucination detection.
📝 Abstract
In recent years, large language models (LLMs) have made remarkable advancements, yet hallucination, where models produce inaccurate or non-factual statements, remains a significant challenge for real-world deployment. Although current classification-based methods, such as SAPLMA, are highly efficient in mitigating hallucinations, they struggle when non-factual information arises in the early or mid-sequence of outputs, reducing their reliability. To address these issues, we propose Hallucination Detection-Neural Differential Equations (HD-NDEs), a novel method that systematically assesses the truthfulness of statements by capturing the full dynamics of LLMs within their latent space. Our approaches apply neural differential equations (Neural DEs) to model the dynamic system in the latent space of LLMs. Then, the sequence in the latent space is mapped to the classification space for truth assessment. The extensive experiments across five datasets and six widely used LLMs demonstrate the effectiveness of HD-NDEs, especially, achieving over 14% improvement in AUC-ROC on the True-False dataset compared to state-of-the-art techniques.