Using Large Language Models for Automated Grading of Student Writing about Science

📅 2024-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Manual grading of large-scale scientific writing in MOOCs is labor-intensive and costly. Method: This study—first to evaluate GPT-4’s automated scoring in a real-world astronomy-themed MOOC (comprising courses in astronomy, astrobiology, and history/philosophy of astronomy; N = 120 adult learners)—introduces a few-shot prompting framework integrating human reference answers, structured rubrics, and holistic score feedback to generate interpretable, criterion-based scoring guides. Contribution/Results: LLM-generated scores achieve high inter-rater reliability with instructor scores at both individual and cohort levels (ICC ≥ 0.85), significantly outperforming peer assessment. The approach demonstrates strong internal consistency, scalability, and pedagogical utility, establishing a generalizable, AI-augmented paradigm for assessing scientific writing in non-STEM disciplines.

Technology Category

Application Category

📝 Abstract
Assessing writing in large classes for formal or informal learners presents a significant challenge. Consequently, most large classes, particularly in science, rely on objective assessment tools such as multiple-choice quizzes, which have a single correct answer. The rapid development of AI has introduced the possibility of using large language models (LLMs) to evaluate student writing. An experiment was conducted using GPT-4 to determine if machine learning methods based on LLMs can match or exceed the reliability of instructor grading in evaluating short writing assignments on topics in astronomy. The audience consisted of adult learners in three massive open online courses (MOOCs) offered through Coursera. One course was on astronomy, the second was on astrobiology, and the third was on the history and philosophy of astronomy. The results should also be applicable to non-science majors in university settings, where the content and modes of evaluation are similar. The data comprised answers from 120 students to 12 questions across the three courses. GPT-4 was provided with total grades, model answers, and rubrics from an instructor for all three courses. In addition to evaluating how reliably the LLM reproduced instructor grades, the LLM was also tasked with generating its own rubrics. Overall, the LLM was more reliable than peer grading, both in aggregate and by individual student, and approximately matched instructor grades for all three online courses. The implication is that LLMs may soon be used for automated, reliable, and scalable grading of student science writing.
Problem

Research questions and friction points this paper is trying to address.

Large Class Teaching
Scientific Essay Evaluation
Writing Ability Assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-4
Automated Essay Scoring
Large-scale Online Courses
🔎 Similar Papers
No similar papers found.
C
Christopher Impey
Department of Astronomy, University of Arizona, Tucson, AZ 85721, United States
M
Matthew Wenger
Department of Astronomy, University of Arizona, Tucson, AZ 85721, United States
N
Nikhil Garuda
Department of Astronomy, University of Arizona, Tucson, AZ 85721, United States
Shahriar Golchin
Shahriar Golchin
University of Arizona
Machine LearningLarge Language ModelsNatural Language ProcessingAI Safety
S
Sarah Stamer
Department of Astronomy, University of Arizona, Tucson, AZ 85721, United States