🤖 AI Summary
Existing large language models (LLMs) exhibit limited capability in fine-grained error identification and pedagogically grounded feedback generation for K–12 English writing. Method: We introduce FEANEL—the first fine-grained English writing error analysis benchmark for foundational education—comprising 1,000 student essays. Grounded in a linguistics-informed, part-of-speech–aware error taxonomy co-designed by language education experts, we propose a multidimensional annotation framework covering error type, severity level, and interpretable, instructionally appropriate feedback. High-quality annotations were produced via expert human labeling under rigorous, multi-tiered annotation guidelines. Contribution/Results: We systematically evaluate leading LLMs across three core pedagogical dimensions: error localization, severity classification, and feedback generation. Results reveal substantial deficiencies—particularly in precise error localization and pedagogically sound feedback—underscoring the urgent need for education-specific model adaptation and optimization.
📝 Abstract
Large Language Models (LLMs) have transformed artificial intelligence, offering profound opportunities for educational applications. However, their ability to provide fine-grained educational feedback for K-12 English writing remains underexplored. In this paper, we challenge the error analysis and pedagogical skills of LLMs by introducing the problem of Fine-grained Error Analysis for English Learners and present the Fine-grained Error ANalysis for English Learners (FEANEL) Benchmark. The benchmark comprises 1,000 essays written by elementary and secondary school students, and a well-developed English writing error taxonomy. Each error is annotated by language education experts and categorized by type, severity, and explanatory feedback, using a part-of-speech-based taxonomy they co-developed. We evaluate state-of-the-art LLMs on the FEANEL Benchmark to explore their error analysis and pedagogical abilities. Experimental results reveal significant gaps in current LLMs' ability to perform fine-grained error analysis, highlighting the need for advancements in particular methods for educational applications.