JELV: A Judge of Edit-Level Validity for Evaluation and Automated Reference Expansion in Grammatical Error Correction

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current grammatical error correction (GEC) systems suffer from evaluation bias and limited generalization due to insufficient reference diversity. To address this, we propose JELV, an edit-level validity discrimination framework that automatically assesses correction quality along three dimensions: grammaticality, fidelity, and fluency. We introduce PEVData—the first human-annotated edit-level dataset—and design a composite metric that decouples false-positive detection from fluency scoring. Furthermore, we propose an automated reference augmentation method leveraging multi-round LLM-as-Judges and distilled DeBERTa. JELV achieves 90% agreement with human judgments, substantially enhancing reference diversity on benchmarks such as BEA19. When mainstream GEC models are retrained using JELV-guided supervision, both correction performance and evaluation correlation reach state-of-the-art levels.

Technology Category

Application Category

📝 Abstract
Existing Grammatical Error Correction (GEC) systems suffer from limited reference diversity, leading to underestimated evaluation and restricted model generalization. To address this issue, we introduce the Judge of Edit-Level Validity (JELV), an automated framework to validate correction edits from grammaticality, faithfulness, and fluency. Using our proposed human-annotated Pair-wise Edit-level Validity Dataset (PEVData) as benchmark, JELV offers two implementations: a multi-turn LLM-as-Judges pipeline achieving 90% agreement with human annotators, and a distilled DeBERTa classifier with 85% precision on valid edits. We then apply JELV to reclassify misjudged false positives in evaluation and derive a comprehensive evaluation metric by integrating false positive decoupling and fluency scoring, resulting in state-of-the-art correlation with human judgments. We also apply JELV to filter LLM-generated correction candidates, expanding the BEA19's single-reference dataset containing 38,692 source sentences. Retraining top GEC systems on this expanded dataset yields measurable performance gains. JELV provides a scalable solution for enhancing reference diversity and strengthening both evaluation and model generalization.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited reference diversity in GEC evaluation
Validates correction edits for grammaticality, faithfulness, fluency
Expands reference datasets to improve model generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated framework validates edits for grammaticality, faithfulness, fluency
Multi-turn LLM pipeline and distilled DeBERTa classifier achieve high accuracy
Expands reference datasets and improves evaluation metrics via edit filtering
🔎 Similar Papers
No similar papers found.
Y
Yuhao Zhan
Zhejiang University, Hangzhou, China
Yuqing Zhang
Yuqing Zhang
University of Groningen
computational linguisticsspeech processing
J
Jing Yuan
Ludwig-Maximilians-Universität München, Munich, Germany
Q
Qixiang Ma
Zhejiang University, Hangzhou, China
Z
Zhiqi Yang
Zhejiang University, Hangzhou, China
Y
Yu Gu
Zhejiang University, Hangzhou, China
Zemin Liu
Zemin Liu
Zhejiang University
Graph LearningGraph Imbalanced Learning
F
Fei Wu
Zhejiang University, Hangzhou, China