LLM-based NLG Evaluation: Current Status and Challenges

📅 2024-02-02
🏛️ Computational Linguistics
📈 Citations: 29
Influential: 0
📄 PDF
🤖 AI Summary
Traditional content-overlap-based NLG evaluation methods (e.g., n-gram metrics) exhibit limited effectiveness in capturing semantic fidelity and human preferences. Method: This paper systematically surveys LLM-driven automatic evaluation paradigms and introduces the first comprehensive taxonomy of LLM-based NLG assessment methodologies—encompassing scoring, prompt engineering, fine-tuning, and human-AI collaboration. Through empirical analysis across diverse benchmarks, it rigorously evaluates robustness, interpretability, and cross-domain generalization of existing approaches. Contribution/Results: The study identifies fundamental limitations—including brittleness under distribution shift, opacity in decision-making, and poor domain adaptability—and pinpoints key open challenges such as reference-free evaluation and domain adaptation. It characterizes performance boundaries and bias sources across mainstream “LLM-as-a-judge” frameworks, including zero-/few-shot prompting, instruction tuning, and contrastive preference modeling. The work establishes a theoretical foundation and practical guidelines for developing next-generation NLG evaluators that are trustworthy, efficient, and interpretable.

Technology Category

Application Category

📝 Abstract
Evaluating natural language generation (NLG) is a vital but challenging problem in natural language processing. Traditional evaluation metrics mainly capturing content (e.g. n-gram) overlap between system outputs and references are far from satisfactory, and large language models (LLMs) such as ChatGPT have demonstrated great potential in NLG evaluation in recent years. Various automatic evaluation methods based on LLMs have been proposed, including metrics derived from LLMs, prompting LLMs, fine-tuning LLMs, and human-LLM collaborative evaluation. In this survey, we first give a taxonomy of LLM-based NLG evaluation methods, and discuss their pros and cons, respectively. Lastly, we discuss several open problems in this area and point out future research directions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating NLG quality beyond traditional content overlap metrics
Exploring LLM potential for diverse NLG evaluation methods
Identifying challenges and future directions in LLM-based evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs for NLG evaluation metrics
Prompting and fine-tuning LLMs for evaluation
Human-LLM collaborative evaluation methods
🔎 Similar Papers
No similar papers found.