๐ค AI Summary
This study addresses the lack of systematic evolutionary analysis in current natural language generation (NLG) evaluation, which hinders its ability to meet future assessment demands along dimensions of impact, qualitative understanding, and safety. The work presents the first comprehensive historical synthesis of NLG evaluation since 1990, integrating retrospective review with forward-looking trend analysis across human evaluation, automatic metrics, and emerging approaches such as LLM-as-Judge. It reveals a paradigmatic shift from linguistically oriented, non-experimental methodologies toward machine learningโdriven, experimentally grounded frameworks. Furthermore, the paper prospectively identifies impact, qualitative insight, and safety as pivotal dimensions for next-generation NLG evaluation, offering a theoretical foundation and strategic direction for developing more robust and holistic assessment systems.
๐ Abstract
Natural Language Generation (NLG) evaluation has changed dramatically since 1990, and will continue to evolve in the future. In 1990, when NLG had close ties to linguistics, there was very little formal experimental evaluation in the modern sense. In 2026, when NLG is closely linked to machine learning, experimental evaluation is expected and indeed fundamental to research. Many evaluation techniques were developed over this period, including most recently LLM-as-Judge. I expect NLG evaluation will continue to evolve in the future. In particular, impact, qualitative, and safety evaluation will become more important as large numbers of people routinely use NLG technology.