π€ AI Summary
This work addresses the challenge of factual hallucinations in large language model (LLM) generation, where existing correction methods often indiscriminately interfere with outputs, inadvertently corrupting correct content. The authors propose PC-LDCD, a novel framework that models hallucinations as geometric anomalies on the factual manifold within the residual streamβs latent space. By leveraging probabilistic circuits (PCNET) to accurately estimate density, the method enables efficient hallucination detection without requiring sampling, external verifiers, or modifications to the base model. Selective intervention is dynamically triggered via contrastive decoding only when anomalies are detected. Evaluated across four benchmarks, PC-LDCD achieves up to 99% AUROC detection accuracy and surpasses state-of-the-art performance on TruthfulQA across three metrics, while reducing content corruption to 53.7% and preserving 79.3% of correct content.
π Abstract
One of the most critical challenges in Large Language Models is their tendency to hallucinate, i.e., produce factually incorrect responses. Existing approaches show promising results in terms of hallucination correction, but still suffer from a main limitation: they apply corrections indiscriminately to every token, corrupting also the originally correct generations. To overcome this drawback, we propose PCNET, a Probabilistic Circuit trained as a tractable density estimator over the LLM residual stream. The method detects hallucinations as geometric anomalies on the factual manifold, which is done via exact Negative Log-Likelihood computation, hence without the need for sampling, external verifiers, or weight modifications, as in existing techniques. To demonstrate its effectiveness, we exploit PCNET as a dynamic gate that distinguishes hallucinated from factual hidden states at each decoding step. This triggers our second main contribution, PC-LDCD (Probabilistic Circuit Latent Density Contrastive Decoding), only when the latent geometry deviates from factual regions, while leaving correct generations untouched. Across four LLMs, ranging from 1B to 8B models, and four benchmarks covering conversational reasoning, knowledge-intensive QA, reading comprehension, and truthfulness, PCNET achieves near-perfect hallucination detection across CoQA, SQuAD v2.0, and TriviaQA, with AUROC reaching up to 99%. Moreover, PC-LDCD obtains the highest True+Info, MC2, and MC3 scores on TruthfulQA in three out of four models, in comparison with state-of-the-art baselines, while reducing the mean corruption rate to 53.7% and achieving a preservation rate of 79.3%. Our proposed method is publicly available on GitHub.