Reliability Crisis of Reference-free Metrics for Grammatical Error Correction

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work exposes the severe vulnerability of reference-free grammatical error correction (GEC) evaluation metrics under adversarial settings: although current metrics correlate with human judgments, they are easily misled by deliberately optimized, low-quality corrections, undermining their reliability for automated assessment. To address this, we propose the first systematic framework for adversarial attacks targeting four major classes of reference-free metrics—SOME, Scribendi, IMPARA, and LLM-based metrics. Experimental results demonstrate that the generated adversarial corrections achieve statistically significant improvements over state-of-the-art GEC systems across multiple metrics, despite severe degradation in both grammaticality and semantic fidelity. Our study empirically reveals a fundamental flaw in the reference-free evaluation paradigm and provides a reproducible benchmark to foster the development of robust, trustworthy GEC evaluation methodologies.

Technology Category

Application Category

📝 Abstract

Reference-free evaluation metrics for grammatical error correction (GEC) have achieved high correlation with human judgments. However, these metrics are not designed to evaluate adversarial systems that aim to obtain unjustifiably high scores. The existence of such systems undermines the reliability of automatic evaluation, as it can mislead users in selecting appropriate GEC systems. In this study, we propose adversarial attack strategies for four reference-free metrics: SOME, Scribendi, IMPARA, and LLM-based metrics, and demonstrate that our adversarial systems outperform the current state-of-the-art. These findings highlight the need for more robust evaluation methods.

Problem

Research questions and friction points this paper is trying to address.

Reference-free GEC metrics lack robustness against adversarial attacks

Adversarial systems can exploit metrics to achieve inflated scores

Current evaluation methods mislead users in selecting GEC systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial attack strategies targeting four reference-free metrics

Demonstrating adversarial systems outperform current state-of-the-art

Highlighting need for robust grammatical error evaluation methods

🔎 Similar Papers

CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction