Shaking to Reveal: Perturbation-Based Detection of LLM Hallucinations

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing self-assessment methods for hallucination detection in large language models (LLMs) rely on output confidence scores, which become unreliable under output distribution shift. Method: This paper proposes a novel self-assessment paradigm based on perturbation sensitivity of intermediate representations. It replaces output confidence with sample-specific, intermediate-layer representation responses to dynamically generated noise-augmented prompts; employs a lightweight encoder to amplify perturbation signals; and introduces a contrastive distance metric to discriminate factual from hallucinated answers. Contribution/Results: The approach eliminates the strong assumption of output distribution consistency required by conventional self-assessment methods. Evaluated on multiple hallucination detection benchmarks, it significantly outperforms state-of-the-art baselines—particularly achieving substantial gains in both accuracy and robustness on highly biased LLMs.

Technology Category

Application Category

📝 Abstract
Hallucination remains a key obstacle to the reliable deployment of large language models (LLMs) in real-world question answering tasks. A widely adopted strategy to detect hallucination, known as self-assessment, relies on the model's own output confidence to estimate the factual accuracy of its answers. However, this strategy assumes that the model's output distribution closely reflects the true data distribution, which may not always hold in practice. As bias accumulates through the model's layers, the final output can diverge from the underlying reasoning process, making output-level confidence an unreliable signal for hallucination detection. In this work, we propose Sample-Specific Prompting (SSP), a new framework that improves self-assessment by analyzing perturbation sensitivity at intermediate representations. These representations, being less influenced by model bias, offer a more faithful view of the model's latent reasoning process. Specifically, SSP dynamically generates noise prompts for each input and employs a lightweight encoder to amplify the changes in representations caused by the perturbation. A contrastive distance metric is then used to quantify these differences and separate truthful from hallucinated responses. By leveraging the dynamic behavior of intermediate representations under perturbation, SSP enables more reliable self-assessment. Extensive experiments demonstrate that SSP significantly outperforms prior methods across a range of hallucination detection benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Detecting hallucinations in LLM question answering tasks
Improving self-assessment via intermediate representation analysis
Reducing model bias impact on hallucination detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perturbation-based detection of LLM hallucinations
Sample-Specific Prompting (SSP) framework
Contrastive distance metric for truthful responses
🔎 Similar Papers
No similar papers found.
J
Jinyuan Luo
Australian Artificial Intelligence Institute, University of Technology Sydney
Z
Zhen Fang
Australian Artificial Intelligence Institute, University of Technology Sydney
Y
Yixuan Li
Department of Computer Sciences, University of Wisconsin-Madison
Seongheon Park
Seongheon Park
University of Wisconsin-Madison
Machine LearningReliable AI
L
Ling Chen
Australian Artificial Intelligence Institute, University of Technology Sydney