LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

๐Ÿ“… 2025-11-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study investigates the feasibility and inter-rater agreement of GPT-4o for automated scoring of short-answer quizzes and project reports in an undergraduate computational linguistics course. Method: Conducting the first systematic evaluation of LLM-based grading in a real classroom setting, we quantify alignment between model and teaching assistant (TA) scores using Pearson correlation coefficients and exact match rates. Contribution/Results: GPT-4o achieves a maximum Pearson correlation of 0.98 with human graders; 55% of short-answer responses receive identical scores. Project report evaluations also demonstrate high holistic agreement. Crucially, the study characterizes the stability and variability of LLM scoring behavior on open-ended educational tasksโ€”revealing both robust consistency and context-sensitive divergence. To support reproducibility and methodological advancement in educational automation, we publicly release all source code and annotated sample data, establishing an empirically grounded benchmark for AI-assisted assessment.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.
Problem

Research questions and friction points this paper is trying to address.

Investigating LLM feasibility for grading student quizzes and reports
Comparing GPT-4o evaluation accuracy against human teaching assistants
Assessing alignment variability in technical open-ended response scoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using GPT-4o for automated grading tasks
Comparing LLM scores with human evaluations
Achieving strong correlation in quiz assessments
๐Ÿ”Ž Similar Papers
No similar papers found.
G
Grace Byun
Emory University, Atlanta, GA, USA
S
Swati Rajwal
Emory University, Atlanta, GA, USA
Jinho D. Choi
Jinho D. Choi
Associate Professor, Emory University
Natural Language ProcessingComputational LinguisticsConversational AI