🤖 AI Summary
Current grammatical error correction (GEC) systems suffer from evaluation bias and limited generalization due to insufficient reference diversity. To address this, we propose JELV, an edit-level validity discrimination framework that automatically assesses correction quality along three dimensions: grammaticality, fidelity, and fluency. We introduce PEVData—the first human-annotated edit-level dataset—and design a composite metric that decouples false-positive detection from fluency scoring. Furthermore, we propose an automated reference augmentation method leveraging multi-round LLM-as-Judges and distilled DeBERTa. JELV achieves 90% agreement with human judgments, substantially enhancing reference diversity on benchmarks such as BEA19. When mainstream GEC models are retrained using JELV-guided supervision, both correction performance and evaluation correlation reach state-of-the-art levels.
📝 Abstract
Existing Grammatical Error Correction (GEC) systems suffer from limited reference diversity, leading to underestimated evaluation and restricted model generalization. To address this issue, we introduce the Judge of Edit-Level Validity (JELV), an automated framework to validate correction edits from grammaticality, faithfulness, and fluency. Using our proposed human-annotated Pair-wise Edit-level Validity Dataset (PEVData) as benchmark, JELV offers two implementations: a multi-turn LLM-as-Judges pipeline achieving 90% agreement with human annotators, and a distilled DeBERTa classifier with 85% precision on valid edits. We then apply JELV to reclassify misjudged false positives in evaluation and derive a comprehensive evaluation metric by integrating false positive decoupling and fluency scoring, resulting in state-of-the-art correlation with human judgments. We also apply JELV to filter LLM-generated correction candidates, expanding the BEA19's single-reference dataset containing 38,692 source sentences. Retraining top GEC systems on this expanded dataset yields measurable performance gains. JELV provides a scalable solution for enhancing reference diversity and strengthening both evaluation and model generalization.