Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Reliable evaluation of language model hallucinations remains challenged by poor metric robustness, weak generalizability, and low agreement with human judgments. This work conducts the first large-scale empirical study to systematically assess 12 mainstream hallucination metrics across 4 datasets, 37 models, and 5 decoding strategies. Results reveal pervasive limitations: narrow evaluation scope, unstable gains under parameter scaling, and low inter-annotator agreement with human labels (mean Krippendorff’s α = 0.41). GPT-4 achieves the highest consistency as an evaluator (α = 0.72). Moreover, pattern-optimizing decoding—specifically Top-k and Nucleus sampling—reduces hallucination rates by 18.3% on average in knowledge-augmented settings. The study introduces a multidimensional benchmarking framework and advocates the LLM-as-a-judge paradigm, providing both empirical foundations and methodological guidance for building trustworthy hallucination evaluation systems.

Technology Category

Application Category

📝 Abstract

Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in current hallucination evaluation: metrics often fail to align with human judgments, take an overtly myopic view of the problem, and show inconsistent gains with parameter scaling. Encouragingly, LLM-based evaluation, particularly with GPT-4, yields the best overall results, and mode-seeking decoding methods seem to reduce hallucinations, especially in knowledge-grounded settings. These findings underscore the need for more robust metrics to understand and quantify hallucinations, and better strategies to mitigate them.

Problem

Research questions and friction points this paper is trying to address.

Assessing robustness of hallucination detection metrics across diverse models

Evaluating alignment between metrics and human judgments on hallucinations

Identifying effective decoding methods to reduce language model hallucinations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale evaluation of hallucination detection metrics

LLM-based evaluation with GPT-4 performs best

Mode-seeking decoding methods reduce hallucinations

🔎 Similar Papers

No similar papers found.