🤖 AI Summary
Hallucination attribution remains challenging in large language model (LLM) question-answering due to the difficulty of tracing spurious outputs to specific input tokens.
Method: This paper proposes the first explainable hallucination detection framework integrating dual-path uncertainty modeling—semantic propagation (via dynamic attention fusion) and linguistic generation (via probabilistic sampling). It introduces a token-level uncertainty scoring mechanism attributable to input tokens, augmented by log-average perplexity reweighting and hierarchical semantic uncertainty estimation.
Contribution/Results: Evaluated across multiple QA benchmarks, the framework achieves a state-of-the-art average AUC of 0.833. It is the first to enable fine-grained visualization of hallucination triggers via token-level uncertainty heatmaps, supporting diagnostic attribution and root-cause analysis of hallucinated generations.
📝 Abstract
Large Language Models (LLMs) have become powerful, but hallucinations remain a vital obstacle to their trustworthy use. While previous works improved the capability of hallucination detection by measuring uncertainty, they all lack the ability to explain the provenance behind why hallucinations occur, i.e., which part of the inputs tends to trigger hallucinations. Recent works on the prompt attack indicate that uncertainty exists in semantic propagation, where attention mechanisms gradually fuse local token information into high-level semantics across layers. Meanwhile, uncertainty also emerges in language generation, due to its probability-based selection of high-level semantics for sampled generations. Based on that, we propose RePPL to recalibrate uncertainty measurement by these two aspects, which dispatches explainable uncertainty scores to each token and aggregates in Perplexity-style Log-Average form as total score. Experiments show that our method achieves the best comprehensive detection performance across various QA datasets on advanced models (average AUC of 0.833), and our method is capable of producing token-level uncertainty scores as explanations for the hallucination. Leveraging these scores, we preliminarily find the chaotic pattern of hallucination and showcase its promising usage.