Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection

📅 2025-04-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
Reliable evaluation of language model hallucinations remains challenged by poor metric robustness, weak generalizability, and low agreement with human judgments. This work conducts the first large-scale empirical study to systematically assess 12 mainstream hallucination metrics across 4 datasets, 37 models, and 5 decoding strategies. Results reveal pervasive limitations: narrow evaluation scope, unstable gains under parameter scaling, and low inter-annotator agreement with human labels (mean Krippendorff’s α = 0.41). GPT-4 achieves the highest consistency as an evaluator (α = 0.72). Moreover, pattern-optimizing decoding—specifically Top-k and Nucleus sampling—reduces hallucination rates by 18.3% on average in knowledge-augmented settings. The study introduces a multidimensional benchmarking framework and advocates the LLM-as-a-judge paradigm, providing both empirical foundations and methodological guidance for building trustworthy hallucination evaluation systems.

Technology Category

Application Category

📝 Abstract
Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in current hallucination evaluation: metrics often fail to align with human judgments, take an overtly myopic view of the problem, and show inconsistent gains with parameter scaling. Encouragingly, LLM-based evaluation, particularly with GPT-4, yields the best overall results, and mode-seeking decoding methods seem to reduce hallucinations, especially in knowledge-grounded settings. These findings underscore the need for more robust metrics to understand and quantify hallucinations, and better strategies to mitigate them.
Problem

Research questions and friction points this paper is trying to address.

Assessing robustness of hallucination detection metrics across diverse models
Evaluating alignment between metrics and human judgments on hallucinations
Identifying effective decoding methods to reduce language model hallucinations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale evaluation of hallucination detection metrics
LLM-based evaluation with GPT-4 performs best
Mode-seeking decoding methods reduce hallucinations
🔎 Similar Papers
No similar papers found.