Hallucination Detection and Evaluation of Large Language Model

📅 2025-12-26

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Hallucinations in large language models (LLMs) severely undermine their reliability, yet existing evaluation methods (e.g., KnowHulu) incur prohibitive computational overhead. Method: This paper proposes HHEM—a lightweight, LLM-free, standalone classification framework for hallucination assessment. Unlike prior approaches, HHEM does not require LLM self-reflection or generation. Contribution/Results: Its key innovations include (1) the first end-to-end classification paradigm independent of LLM generation processes; (2) a segment-wise retrieval mechanism to enhance fine-grained hallucination detection; and (3) novel insights—derived from CDF-based statistical analysis and non-fictionality verification—revealing an inverse U-shaped relationship between model scale and hallucination stability: 7B–9B models exhibit minimal hallucinations, whereas mid-sized models are most unstable. Experiments show HHEM reduces per-sample detection time from 8 hours to 10 minutes, achieving 82.2% accuracy and 78.9% true positive rate.

Technology Category

Application Category

📝 Abstract

Hallucinations in Large Language Models (LLMs) pose a significant challenge, generating misleading or unverifiable content that undermines trust and reliability. Existing evaluation methods, such as KnowHalu, employ multi-stage verification but suffer from high computational costs. To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high detection accuracy. We conduct a comparative analysis of hallucination detection methods across various LLMs, evaluating True Positive Rate (TPR), True Negative Rate (TNR), and Accuracy on question-answering (QA) and summarization tasks. Our results show that HHEM reduces evaluation time from 8 hours to 10 minutes, while HHEM with non-fabrication checking achieves the highest accuracy (82.2%) and TPR (78.9%). However, HHEM struggles with localized hallucinations in summarization tasks. To address this, we introduce segment-based retrieval, improving detection by verifying smaller text components. Additionally, our cumulative distribution function (CDF) analysis indicates that larger models (7B-9B parameters) generally exhibit fewer hallucinations, while intermediate-sized models show higher instability. These findings highlight the need for structured evaluation frameworks that balance computational efficiency with robust factual validation, enhancing the reliability of LLM-generated content.

Problem

Research questions and friction points this paper is trying to address.

Detect hallucinations in LLMs efficiently using lightweight classification

Improve localized hallucination detection in summarization tasks

Analyze hallucination patterns across different model sizes systematically

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight classification-based framework for hallucination detection

Segment-based retrieval to improve localized hallucination identification

Comparative analysis across models using efficiency and accuracy metrics

🔎 Similar Papers

AutoHall: Automated Hallucination Dataset Generation for Large Language Models