π€ AI Summary
Existing interpretability research in grammatical error correction (GEC) overlooks joint modeling of correction and explanation, and lacks a comprehensive Chinese evaluation benchmark. Method: We propose explanatory GEC (EXGEC), a novel task emphasizing co-modeling of correction and explanation, and introduce EXCGECβthe first Chinese edit-level interpretable GEC benchmark, comprising 8,216 samples annotated with hybrid edit explanations (operation type, position, and rationale). We design a multi-task learning framework supporting both pre- and post-hoc explanation generation, adopt METEOR and ROUGE for free-text explanation evaluation, and conduct human evaluation for validation. Results: Experiments reveal strong alignment between automatic metrics and human judgment; however, the multi-task model underperforms the pipeline approach, highlighting key challenges in joint modeling. This work provides three foundational contributions: a formal task definition for EXGEC, the EXCGEC benchmark resource, and a standardized evaluation paradigm for interpretable GEC.
π Abstract
Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited scenario, where they ignore the interaction between corrections and explanations and have not established a corresponding comprehensive benchmark. To bridge the gap, this paper first introduces the task of EXplainable GEC (EXGEC), which focuses on the integral role of correction and explanation tasks. To facilitate the task, we propose EXCGEC, a tailored benchmark for Chinese EXGEC consisting of 8,216 explanation-augmented samples featuring the design of hybrid edit-wise explanations. We then benchmark several series of LLMs in multi-task learning settings, including post-explaining and pre-explaining. To promote the development of the task, we also build a comprehensive evaluation suite by leveraging existing automatic metrics and conducting human evaluation experiments to demonstrate the human consistency of the automatic metrics for free-text explanations. Our experiments reveal the effectiveness of evaluating free-text explanations using traditional metrics like METEOR and ROUGE, and the inferior performance of multi-task models compared to the pipeline solution, indicating its challenges to establish positive effects in learning both tasks.