🤖 AI Summary
Existing grammatical error correction (GEC) evaluation metrics suffer from poor interpretability, hindering precise localization of system deficiencies. To address this, we propose CLEME2.0—a reference-based, interpretable metric that, for the first time, decouples GEC evaluation into four semantically explicit and attributable edit categories: correct corrections, erroneous corrections, missed corrections, and over-corrections. Leveraging rule-guided fine-grained alignment and a statistical assessment framework, CLEME2.0 models edit classification grounded in reference corrections. Evaluated on two human-annotated datasets and six benchmark datasets, it achieves state-of-the-art performance, significantly improving correlation with human judgments. CLEME2.0 not only surpasses all existing reference-based and reference-free metrics in overall effectiveness—particularly excelling in correlation-based evaluation—but also enables system-level diagnostic analysis and targeted model improvement.
📝 Abstract
The paper focuses on the interpretability of Grammatical Error Correction (GEC) evaluation metrics, which received little attention in previous studies. To bridge the gap, we introduce **CLEME2.0**, a reference-based metric describing four fundamental aspects of GEC systems: hit-correction, wrong-correction, under-correction, and over-correction. They collectively contribute to exposing critical qualities and locating drawbacks of GEC systems. Evaluating systems by combining these aspects also leads to superior human consistency over other reference-based and reference-less metrics. Extensive experiments on two human judgment datasets and six reference datasets demonstrate the effectiveness and robustness of our method, achieving a new state-of-the-art result. Our codes are released at https://github.com/THUKElab/CLEME.