LLM-based NLG Evaluation: Current Status and Challenges

📅 2024-02-02

🏛️ Computational Linguistics

📈 Citations: 29

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Traditional content-overlap-based NLG evaluation methods (e.g., n-gram metrics) exhibit limited effectiveness in capturing semantic fidelity and human preferences. Method: This paper systematically surveys LLM-driven automatic evaluation paradigms and introduces the first comprehensive taxonomy of LLM-based NLG assessment methodologies—encompassing scoring, prompt engineering, fine-tuning, and human-AI collaboration. Through empirical analysis across diverse benchmarks, it rigorously evaluates robustness, interpretability, and cross-domain generalization of existing approaches. Contribution/Results: The study identifies fundamental limitations—including brittleness under distribution shift, opacity in decision-making, and poor domain adaptability—and pinpoints key open challenges such as reference-free evaluation and domain adaptation. It characterizes performance boundaries and bias sources across mainstream “LLM-as-a-judge” frameworks, including zero-/few-shot prompting, instruction tuning, and contrastive preference modeling. The work establishes a theoretical foundation and practical guidelines for developing next-generation NLG evaluators that are trustworthy, efficient, and interpretable.

Technology Category

Application Category

📝 Abstract

Evaluating natural language generation (NLG) is a vital but challenging problem in natural language processing. Traditional evaluation metrics mainly capturing content (e.g. n-gram) overlap between system outputs and references are far from satisfactory, and large language models (LLMs) such as ChatGPT have demonstrated great potential in NLG evaluation in recent years. Various automatic evaluation methods based on LLMs have been proposed, including metrics derived from LLMs, prompting LLMs, fine-tuning LLMs, and human-LLM collaborative evaluation. In this survey, we first give a taxonomy of LLM-based NLG evaluation methods, and discuss their pros and cons, respectively. Lastly, we discuss several open problems in this area and point out future research directions.

Problem

Research questions and friction points this paper is trying to address.

Evaluating NLG quality beyond traditional content overlap metrics

Exploring LLM potential for diverse NLG evaluation methods

Identifying challenges and future directions in LLM-based evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs for NLG evaluation metrics

Prompting and fine-tuning LLMs for evaluation

Human-LLM collaborative evaluation methods

🔎 Similar Papers

No similar papers found.