NLG Evaluation: Past, Present, Future

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study addresses the lack of systematic evolutionary analysis in current natural language generation (NLG) evaluation, which hinders its ability to meet future assessment demands along dimensions of impact, qualitative understanding, and safety. The work presents the first comprehensive historical synthesis of NLG evaluation since 1990, integrating retrospective review with forward-looking trend analysis across human evaluation, automatic metrics, and emerging approaches such as LLM-as-Judge. It reveals a paradigmatic shift from linguistically oriented, non-experimental methodologies toward machine learning–driven, experimentally grounded frameworks. Furthermore, the paper prospectively identifies impact, qualitative insight, and safety as pivotal dimensions for next-generation NLG evaluation, offering a theoretical foundation and strategic direction for developing more robust and holistic assessment systems.

📝 Abstract

Natural Language Generation (NLG) evaluation has changed dramatically since 1990, and will continue to evolve in the future. In 1990, when NLG had close ties to linguistics, there was very little formal experimental evaluation in the modern sense. In 2026, when NLG is closely linked to machine learning, experimental evaluation is expected and indeed fundamental to research. Many evaluation techniques were developed over this period, including most recently LLM-as-Judge. I expect NLG evaluation will continue to evolve in the future. In particular, impact, qualitative, and safety evaluation will become more important as large numbers of people routinely use NLG technology.

Problem

Research questions and friction points this paper is trying to address.

NLG evaluation

impact evaluation

qualitative evaluation

safety evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

NLG evaluation

LLM-as-Judge

qualitative evaluation

safety evaluation