π€ AI Summary
Hallucination in large language models (LLMs) necessitates reliable, comparable automated hallucination evaluation (AHE) methods; however, existing AHE approaches are fragmented and lack a unified theoretical foundation. Method: We conduct a systematic literature review of 2018β2024 publications, proposing the first three-dimensional analytical framework grounded in hallucination granularity (fact-level), evaluator design principles, and evaluation dimensions. We further perform bibliometric analysis, cross-model comparative evaluation, and systematic meta-analysis. Contribution/Results: Our work reveals the co-evolutionary pattern between AHE paradigms and generative model capabilities, establishes the first structured taxonomy of AHE methods, and identifies critical evaluation blind spots. Collectively, this study provides both theoretical grounding and practical guidance for designing trustworthy natural language generation (NLG) benchmarks and assessing LLM reliability.
π Abstract
Hallucination in Natural Language Generation (NLG) is like the elephant in the room, obvious but often overlooked until recent achievements significantly improved the fluency and grammaticality of generated text. As the capabilities of text generation models have improved, researchers have begun to pay more attention to the phenomenon of hallucination. Despite significant progress in this field in recent years, the evaluation system for hallucination is complex and diverse, lacking clear organization. We are the first to comprehensively survey how various evaluation methods have evolved with the development of text generation models from three dimensions, including hallucinated fact granularity, evaluator design principles, and assessment facets. This survey aims to help researchers identify current limitations in hallucination evaluation and highlight future research directions.