Dean of LLM Tutors: Exploring Comprehensive and Automated Evaluation of LLM-generated Educational Feedback via LLM Feedback Evaluators

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address hallucination, unreliability, and ethical risks in large language model (LLM)-generated educational feedback, this paper proposes the first multi-dimensional automated evaluation framework tailored to pedagogical contexts, assessing content accuracy, instructional effectiveness, and factual hallucination. We introduce a novel two-tiered “Dean-level supervising Mentor-level” evaluation paradigm and integrate zero-shot/few-shot prompting with supervised fine-tuning to train feedback evaluators on an open-source synthetic assignment dataset. After fine-tuning GPT-4.1, our evaluator achieves human-expert-level performance (79.8% accuracy, 79.4% F1). We further conduct a systematic quality assessment of 2,000 feedback instances generated by commercial LLMs. This work establishes a reproducible, scalable evaluation infrastructure to support safe, controllable deployment of AI in education.

Technology Category

Application Category

📝 Abstract
The use of LLM tutors to provide automated educational feedback to students on student assignment submissions has received much attention in the AI in Education field. However, the stochastic nature and tendency for hallucinations in LLMs can undermine both quality of learning experience and adherence to ethical standards. To address this concern, we propose a method that uses LLM feedback evaluators (DeanLLMs) to automatically and comprehensively evaluate feedback generated by LLM tutor for submissions on university assignments before it is delivered to students. This allows low-quality feedback to be rejected and enables LLM tutors to improve the feedback they generated based on the evaluation results. We first proposed a comprehensive evaluation framework for LLM-generated educational feedback, comprising six dimensions for feedback content, seven for feedback effectiveness, and three for hallucination types. Next, we generated a virtual assignment submission dataset covering 85 university assignments from 43 computer science courses using eight commonly used commercial LLMs. We labelled and open-sourced the assignment dataset to support the fine-tuning and evaluation of LLM feedback evaluators. Our findings show that o3-pro demonstrated the best performance in zero-shot labelling of feedback while o4-mini demonstrated the best performance in few-shot labelling of feedback. Moreover, GPT-4.1 achieved human expert level performance after fine-tuning (Accuracy 79.8%, F1-score 79.4%; human average Accuracy 78.3%, F1-score 82.6%). Finally, we used our best-performance model to evaluate 2,000 assignment feedback instances generated by 10 common commercial LLMs, 200 each, to compare the quality of feedback generated by different LLMs. Our LLM feedback evaluator method advances our ability to automatically provide high-quality and reliable educational feedback to students.
Problem

Research questions and friction points this paper is trying to address.

Evaluating quality of LLM-generated educational feedback automatically
Addressing hallucinations and ethical issues in LLM tutor feedback
Comparing performance of different LLMs in generating educational feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM feedback evaluators assess tutor feedback automatically
Comprehensive framework evaluates feedback across multiple dimensions
Fine-tuned GPT-4.1 matches human expert evaluation performance
🔎 Similar Papers
No similar papers found.