🤖 AI Summary
Reliable evaluation of language model hallucinations remains challenged by poor metric robustness, weak generalizability, and low agreement with human judgments. This work conducts the first large-scale empirical study to systematically assess 12 mainstream hallucination metrics across 4 datasets, 37 models, and 5 decoding strategies. Results reveal pervasive limitations: narrow evaluation scope, unstable gains under parameter scaling, and low inter-annotator agreement with human labels (mean Krippendorff’s α = 0.41). GPT-4 achieves the highest consistency as an evaluator (α = 0.72). Moreover, pattern-optimizing decoding—specifically Top-k and Nucleus sampling—reduces hallucination rates by 18.3% on average in knowledge-augmented settings. The study introduces a multidimensional benchmarking framework and advocates the LLM-as-a-judge paradigm, providing both empirical foundations and methodological guidance for building trustworthy hallucination evaluation systems.
📝 Abstract
Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in current hallucination evaluation: metrics often fail to align with human judgments, take an overtly myopic view of the problem, and show inconsistent gains with parameter scaling. Encouragingly, LLM-based evaluation, particularly with GPT-4, yields the best overall results, and mode-seeking decoding methods seem to reduce hallucinations, especially in knowledge-grounded settings. These findings underscore the need for more robust metrics to understand and quantify hallucinations, and better strategies to mitigate them.