The QCET Taxonomy of Standard Quality Criterion Names and Definitions for the Evaluation of NLP Systems

πŸ“… 2025-09-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In NLP evaluation, quality criteria such as β€œfluency” often lack consistent definitions across studies, undermining cross-experiment comparability and impeding scientific progress. To address this, we propose QCET (Quality Criteria Evaluation Taxonomy)β€”the first empirically grounded, hierarchical taxonomy of quality criteria. Derived from descriptive analysis of three large-scale NLP evaluation surveys, QCET systematically identifies, defines, and organizes quality dimensions in a principled, evidence-based manner. It supports metric-to-criterion mapping, standardized experimental design, and compliance auditing. We have classified hundreds of existing evaluation metrics under QCET, enabling consistent interpretation and facilitating transparent, reproducible, and comparable evaluations. QCET establishes a unified semantic foundation and infrastructure for NLP quality assessment, thereby enhancing rigor, interoperability, and cumulative knowledge building in the field.

Technology Category

Application Category

πŸ“ Abstract
Prior work has shown that two NLP evaluation experiments that report results for the same quality criterion name (e.g. Fluency) do not necessarily evaluate the same aspect of quality, and the comparability implied by the name can be misleading. Not knowing when two evaluations are comparable in this sense means we currently lack the ability to draw reliable conclusions about system quality on the basis of multiple, independently conducted evaluations. This in turn hampers the ability of the field to progress scientifically as a whole, a pervasive issue in NLP since its beginning (Sparck Jones, 1981). It is hard to see how the issue of unclear comparability can be fully addressed other than by the creation of a standard set of quality criterion names and definitions that the several hundred quality criterion names actually in use in the field can be mapped to, and grounded in. Taking a strictly descriptive approach, the QCET Quality Criteria for Evaluation Taxonomy derives a standard set of quality criterion names and definitions from three surveys of evaluations reported in NLP, and structures them into a hierarchy where each parent node captures common aspects of its child nodes. We present QCET and the resources it consists of, and discuss its three main uses in (i) establishing comparability of existing evaluations, (ii) guiding the design of new evaluations, and (iii) assessing regulatory compliance.
Problem

Research questions and friction points this paper is trying to address.

Standardizing quality criterion names in NLP evaluations
Enabling reliable comparison of independent NLP system evaluations
Addressing misleading comparability in NLP quality assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed standard taxonomy for quality criteria
Created hierarchy from descriptive survey analysis
Enables comparability and evaluation design guidance
πŸ”Ž Similar Papers
No similar papers found.