🤖 AI Summary
This study addresses the challenge of insufficient feedback on handwritten mathematics assignments in large-scale undergraduate STEM courses, where instructor workload often limits timely and detailed assessment. Leveraging handwritten responses from nearly 800 students in a single-variable calculus course at the University of California, Irvine, the work integrates optical character recognition (OCR) and large language models to automatically generate scores and formative feedback guided by structured rubrics through prompt engineering. The project introduces a multi-perspective evaluation protocol tailored for handwritten mathematical work, establishes generalizable principles for rubric and prompt design, and constructs the first benchmark framework for reproducible AI-assisted grading research. Experimental results demonstrate strong agreement between AI-generated scores and teaching assistants’ evaluations, with the feedback widely endorsed by both students and independent reviewers, confirming the system’s reliability and practical utility in authentic instructional settings.
📝 Abstract
Grading in large undergraduate STEM courses often yields minimal feedback due to heavy instructional workloads. We present a large-scale empirical study of AI grading on real, handwritten single-variable calculus work from UC Irvine. Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions from nearly 800 students. In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review, finding strong alignment with TA scoring and a large majority of AI-generated feedback rated as correct or acceptable across quizzes.
Beyond calculus, this setting highlights core challenges in OCR-conditioned mathematical reasoning and partial-credit assessment. We analyze key failure modes, propose practical rubric- and prompt-design principles, and introduce a multi-perspective evaluation protocol for reliable, real-course deployment. Building on the dataset and evaluation framework developed here, we outline a standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research.