Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the reliability and validity of five large language models—Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B—in automatically scoring undergraduate Italian essays in higher education. Addressing the challenge of domain-specific humanistic assessment, we employ a four-dimensional, discipline-grounded rubric (relevance, feasibility, topical alignment, and originality) and conduct iterative prompt engineering, followed by weighted Kappa and Kendall’s W analyses to quantify inter-model agreement and alignment with expert human scoring. Results reveal low overall agreement between models and human raters (κ < 0.4) and poor internal consistency; limited convergence occurs only on relevance and originality, while consensus collapses on topical alignment and feasibility. This work is the first to empirically identify systematic LLM biases in authentic humanities writing evaluation—stemming from insufficient contextual sensitivity and domain knowledge—and establishes a reproducible methodological benchmark and critical failure boundaries for educational AI assessment.

Technology Category

Application Category

📝 Abstract
This study investigates the reliability and validity of five advanced Large Language Models (LLMs), Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B, for automated essay scoring in a real world higher education context. A total of 67 Italian-language student essays, written as part of a university psychology course, were evaluated using a four-criterion rubric (Pertinence, Coherence, Originality, Feasibility). Each model scored all essays across three prompt replications to assess intra-model stability. Human-LLM agreement was consistently low and non-significant (Quadratic Weighted Kappa), and within-model reliability across replications was similarly weak (median Kendall's W < 0.30). Systematic scoring divergences emerged, including a tendency to inflate Coherence and inconsistent handling of context-dependent dimensions. Inter-model agreement analysis revealed moderate convergence for Coherence and Originality, but negligible concordance for Pertinence and Feasibility. Although limited in scope, these findings suggest that current LLMs may struggle to replicate human judgment in tasks requiring disciplinary insight and contextual sensitivity. Human oversight remains critical when evaluating open-ended academic work, particularly in interpretive domains.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reliability of LLMs for automated essay scoring
Assessing human-LLM agreement in academic essay evaluation
Identifying systematic scoring divergences in LLM assessments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Used five advanced LLMs for essay scoring
Assessed reliability via prompt replications
Analyzed human-LLM and inter-model agreement
🔎 Similar Papers
No similar papers found.
Andrea Gaggioli
Andrea Gaggioli
Research Center in Communication Psychology (PSICOM), Università Cattolica del Sacro Cuore
Positive TechnologyUser ExperienceCyberpsychology
G
Giuseppe Casaburi
Independent Researcher, Salerno, Italy
L
Leonardo Ercolani
Department of Advanced Computing Sciences, Maastricht University, Maastricht, Netherlands
F
Francesco Collova'
Independent Researcher, Bacoli (Napoli), Italy
P
Pietro Torre
Direzione Centrale IT, Istat, Roma, Italy
F
Fabrizio Davide
Direzione Centrale IT, Istat, Roma, Italy