Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark

📅 2026-02-28

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study addresses the challenge of insufficient feedback on handwritten mathematics assignments in large-scale undergraduate STEM courses, where instructor workload often limits timely and detailed assessment. Leveraging handwritten responses from nearly 800 students in a single-variable calculus course at the University of California, Irvine, the work integrates optical character recognition (OCR) and large language models to automatically generate scores and formative feedback guided by structured rubrics through prompt engineering. The project introduces a multi-perspective evaluation protocol tailored for handwritten mathematical work, establishes generalizable principles for rubric and prompt design, and constructs the first benchmark framework for reproducible AI-assisted grading research. Experimental results demonstrate strong agreement between AI-generated scores and teaching assistants’ evaluations, with the feedback widely endorsed by both students and independent reviewers, confirming the system’s reliability and practical utility in authentic instructional settings.

Technology Category

Application Category

📝 Abstract

Grading in large undergraduate STEM courses often yields minimal feedback due to heavy instructional workloads. We present a large-scale empirical study of AI grading on real, handwritten single-variable calculus work from UC Irvine. Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions from nearly 800 students. In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review, finding strong alignment with TA scoring and a large majority of AI-generated feedback rated as correct or acceptable across quizzes. Beyond calculus, this setting highlights core challenges in OCR-conditioned mathematical reasoning and partial-credit assessment. We analyze key failure modes, propose practical rubric- and prompt-design principles, and introduce a multi-perspective evaluation protocol for reliable, real-course deployment. Building on the dataset and evaluation framework developed here, we outline a standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research.

Problem

Research questions and friction points this paper is trying to address.

AI grading

handwritten mathematics

large-scale evaluation

partial-credit assessment

OCR-conditioned reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

OCR-conditioned LLMs

rubric-guided prompting

formative feedback