🤖 AI Summary
This work addresses the challenge of hallucination in large language models, which undermines factual reliability in high-stakes applications. Existing detection methods often rely on surface-level output signals and overlook anomalies in internal reasoning dynamics. To bridge this gap, the paper introduces zigzag persistence from topological data analysis into hallucination detection for the first time. By modeling the dynamic evolution of layer-wise attention mechanisms, the authors construct a zigzag filtration to extract topological features of attention structures. This approach reveals fundamental differences between factual and hallucinatory generations at the structural level and enables effective hallucination identification using only partial model depth. Extensive experiments demonstrate that the proposed method outperforms strong baselines across multiple benchmarks, confirming the cross-model generalizability and robustness of the extracted topological signatures.
📝 Abstract
The factual reliability of Large Language Models (LLMs) remains a critical barrier to their adoption in high-stakes domains due to their propensity to hallucinate. Current detection methods often rely on surface-level signals from the model's output, overlooking the failures that occur within the model's internal reasoning process. In this paper, we introduce a new paradigm for hallucination detection by analyzing the dynamic topology of the evolution of model's layer-wise attention. We model the sequence of attention matrices as a zigzag graph filtration and use zigzag persistence, a tool from Topological Data Analysis, to extract a topological signature. Our core hypothesis is that factual and hallucinated generations exhibit distinct topological signatures. We validate our framework, HalluZig, on multiple benchmarks, demonstrating that it outperforms strong baselines. Furthermore, our analysis reveals that these topological signatures are generalizable across different models and hallucination detection is possible only using structural signatures from partial network depth.