Probing LLM Hallucination from Within: Perturbation-Driven Approach via Internal Knowledge

📅 2024-11-14

📈 Citations: 2

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Large language models (LLMs) frequently generate hallucinated text, yet existing detection methods rely on external knowledge bases, supervised fine-tuning, or large-scale annotated data—and lack fine-grained hallucination categorization. This work introduces *hallucination probing*, a novel task that classifies hallucinations into three distinct types: *alignment*, *misplacement*, and *fabrication*—without requiring external knowledge or labeled supervision. Leveraging pronounced differences in internal activation patterns across model layers under critical entity perturbations, we propose SHINE: an unsupervised, zero-shot, and fine-tuning-free method that integrates input perturbation analysis, inter-layer activation modeling, and zero-sample pattern discrimination to enable both detection and fine-grained classification. Evaluated across four LLMs and four benchmark datasets, SHINE consistently outperforms seven state-of-the-art baselines in hallucination detection—achieving new SOTA performance—and is the first method to accurately distinguish all three hallucination types.

Technology Category

Application Category

📝 Abstract

LLM hallucination, where unfaithful text is generated, presents a critical challenge for LLMs' practical applications. Current detection methods often resort to external knowledge, LLM fine-tuning, or supervised training with large hallucination-labeled datasets. Moreover, these approaches do not distinguish between different types of hallucinations, which is crucial for enhancing detection performance. To address such limitations, we introduce hallucination probing, a new task that classifies LLM-generated text into three categories: aligned, misaligned, and fabricated. Driven by our novel discovery that perturbing key entities in prompts affects LLM's generation of these three types of text differently, we propose SHINE, a novel hallucination probing method that does not require external knowledge, supervised training, or LLM fine-tuning. SHINE is effective in hallucination probing across three modern LLMs, and achieves state-of-the-art performance in hallucination detection, outperforming seven competing methods across four datasets and four LLMs, underscoring the importance of probing for accurate detection.

Problem

Research questions and friction points this paper is trying to address.

Detect LLM hallucination without external knowledge or training

Classify LLM-generated text into aligned, misaligned, and fabricated types

Improve hallucination detection performance by probing internal knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Perturbing key entities to classify hallucinations

SHINE method avoids external knowledge and training

State-of-the-art performance in hallucination detection

🔎 Similar Papers

No similar papers found.