Can we trust LLMs as a tutor for our students? Evaluating the Quality of LLM-generated Feedback in Statistics Exams

📅 2025-11-06
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Can large language models (LLMs) be trusted as student tutors in statistics education? This study presents the first systematic, in-situ evaluation of GPT-4–generated personalized feedback within a real university-level statistics course. We developed an LLM-integrated online learning platform enabling exercise submission, automated response analysis, and feedback delivery, and introduced a dual-dimensional evaluation framework combining task-level analysis with a structured feedback taxonomy. Empirical analysis of 2,389 real-world feedback instances revealed a ~7% error rate; feedback predominantly addressed correctness verification but rarely fostered conceptual elaboration or self-regulated learning support. The findings illuminate both the auxiliary potential and critical limitations of LLMs in scaling high-quality instruction in higher education. Methodologically, we establish a reusable, empirically grounded assessment paradigm for LLM-generated pedagogical feedback quality—providing actionable evidence for prompt engineering refinement and human-in-the-loop quality assurance design.

Technology Category

Application Category

📝 Abstract
One of the central challenges for instructors is offering meaningful individual feedback, especially in large courses. Faced with limited time and resources, educators are often forced to rely on generalized feedback, even when more personalized support would be pedagogically valuable. To overcome this limitation, one potential technical solution is to utilize large language models (LLMs). For an exploratory study using a new platform connected with LLMs, we conducted a LLM-corrected mock exam during the"Introduction to Statistics"lecture at the University of Munich (Germany). The online platform allows instructors to upload exercises along with the correct solutions. Students complete these exercises and receive overall feedback on their results, as well as individualized feedback generated by GPT-4 based on the correct answers provided by the lecturers. The resulting dataset comprised task-level information for all participating students, including individual responses and the corresponding LLM-generated feedback. Our systematic analysis revealed that approximately 7 % of the 2,389 feedback instances contained errors, ranging from minor technical inaccuracies to conceptually misleading explanations. Further, using a combined feedback framework approach, we found that the feedback predominantly focused on explaining why an answer was correct or incorrect, with fewer instances providing deeper conceptual insights, learning strategies or self-regulatory advice. These findings highlight both the potential and the limitations of deploying LLMs as scalable feedback tools in higher education, emphasizing the need for careful quality monitoring and prompt design to maximize their pedagogical value.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-generated feedback quality for statistics exams
Assessing error rates and conceptual accuracy of automated feedback
Analyzing feedback scope limitations in higher education applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using GPT-4 to generate personalized student feedback
Creating an online platform for automated exam correction
Systematically analyzing feedback quality through error classification
🔎 Similar Papers
No similar papers found.
M
Markus Herklotz
Social Data Science and AI Lab, LMU Munich, Munich, Germany
N
Niklas Ippisch
Social Data Science and AI Lab, LMU Munich, Munich, Germany
Anna-Carolina Haensch
Anna-Carolina Haensch
LMU Munich
Synthetic DataMultiple ImputationSurvey Methodology