Order in the Evaluation Court: A Critical Analysis of NLG Evaluation Trends

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the longstanding lack of a systematic and reliable evaluation framework in natural language generation (NLG), where metric selection is often driven by task-specific conventions and the alignment between large language models as judges (LaaJ) and human evaluation remains insufficiently validated. By leveraging automated information extraction, the authors construct a large-scale dataset of evaluation practices from 14,171 NLG papers published over the past six years at four top-tier conferences. Their analysis quantitatively reveals three critical issues: significant divergence in evaluation methodologies across tasks (e.g., over 40% of dialogue generation studies in 2025 adopt LaaJ, while machine translation still heavily relies on n-gram-based metrics), limited discriminative power of commonly used general-purpose metrics, and only moderate to low correlation between LaaJ and human judgments—with fewer than 8% of studies explicitly validating this alignment. Based on these findings, the paper offers practical recommendations to enhance the rigor of NLG evaluation.

Technology Category

Application Category

📝 Abstract
Despite advances in Natural Language Generation (NLG), evaluation remains challenging. Although various new metrics and LLM-as-a-judge (LaaJ) methods are proposed, human judgment persists as the gold standard. To systematically review how NLG evaluation has evolved, we employ an automatic information extraction scheme to gather key information from NLG papers, focusing on different evaluation methods (metrics, LaaJ and human evaluation). With extracted metadata from 14,171 papers across four major conferences (ACL, EMNLP, NAACL, and INLG) over the past six years, we reveal several critical findings: (1) Task Divergence: While Dialogue Generation demonstrates a rapid shift toward LaaJ (>40% in 2025), Machine Translation remains locked into n-gram metrics, and Question Answering exhibits a substantial decline in the proportion of studies conducting human evaluation. (2) Metric Inertia: Despite the development of semantic metrics, general-purpose metrics (e.g., BLEU, ROUGE) continue to be widely used across tasks without empirical justification, often lacking the discriminative power to distinguish between specific quality criteria. (3) Human-LaaJ Divergence: Our association analysis challenges the assumption that LLMs act as mere proxies for humans; LaaJ and human evaluations prioritize very different signals, and explicit validation is scarce (<8% of papers comparing the two), with only moderate to low correlation. Based on these observations, we derive practical recommendations to improve the rigor of future NLG evaluation.
Problem

Research questions and friction points this paper is trying to address.

NLG evaluation
evaluation metrics
LLM-as-a-judge
human evaluation
metric validity
Innovation

Methods, ideas, or system contributions that make the work stand out.

automatic information extraction
LLM-as-a-judge
evaluation trends
human evaluation
NLG metrics
🔎 Similar Papers
No similar papers found.
Jing Yang
Jing Yang
Post-doc Researcher at the XplainNLP group, Quality and Usability lab at TU Berlin and BIFOLD
Natural Language ProcessingXAIFact-checkingMisinformation
Nils Feldhus
Nils Feldhus
TU Berlin, BIFOLD, DFKI (Guest)
Natural Language ProcessingInterpretabilityExplainable AI
Salar Mohtaj
Salar Mohtaj
German Research Center for Artificial Intelligence
LLM EvaluationNatural Language ProcessingFake News DetectionHate Speech Detection
Leonhard Hennig
Leonhard Hennig
Senior Researcher, Deutsches Forschungszentrum für Künstliche Intelligenz
Natural Language ProcessingInformation ExtractionMachine Learning
Qianli Wang
Qianli Wang
DFKI & TU Berlin
ExplainabilityNatural Language Processing
Eleni Metheniti
Eleni Metheniti
ANITI
Sherzod Hakimov
Sherzod Hakimov
University of Potsdam
Natural Language ProcessingSemantic WebInformation ExtractionQuestion AnsweringMultimodal Representation Learning
C
Charlott Jakob
Technische Universität Berlin
Veronika Solopova
Veronika Solopova
Technische Universität Berlin
Computational linguisticsEthics of AI
Konrad Rieck
Konrad Rieck
Technische Universität Berlin
Computer SecurityMachine Learning
David Schlangen
David Schlangen
Professor, "Foundations of Computational Linguistics", University of Potsdam
Computational LinguisticsArtificial IntelligenceConversational AgentsDialogue Systems
Sebastian Möller
Sebastian Möller
Professor for Quality and Usability, TU Berlin and Scientific Director, DFKI
Quality of ExperienceUser ExperienceSpeechDialogNatural Language Processing
Vera Schmitt
Vera Schmitt
Head of XplaiNLP Research Group at TU Berlin
NLP/LLMsXAIHCIDisinformationUsable Privacy