Automated Grading of Handwritten Mathematics Using Vision-Capable LLMs

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

Automated scoring of handwritten multi-step mathematical solutions remains challenging to scale due to the complexity of image understanding and grading logic. This work proposes an end-to-end approach that leverages vision-capable large language models (VLMs) to simultaneously transcribe handwritten solutions and evaluate them against instructor-defined rubrics within a single inference pass. The method is systematically evaluated for the first time on authentic homework assignments from university-level STEM courses. Results demonstrate high overall accuracy, with 87% of errors attributable to transcription failures rather than flaws in the scoring logic. This finding identifies transcription as the primary bottleneck and enables a taxonomy of common error patterns, thereby validating the feasibility of the proposed framework and highlighting clear directions for future refinement.

📝 Abstract

Automated grading systems have enabled scalable assessment for many response types, but handwritten mathematics remains a barrier due to the complexity of multi-step solutions. Vision-capable large language models (LLMs) offer new opportunities here, yet their reliability in authentic instructional settings remains poorly understood. We present an empirical evaluation of an LLM-based grader for handwritten mathematical work using instructor-defined rubrics. Extending a prior pipeline for typed responses, we integrate transcription and rubric-based evaluation of photographic submissions within a single LLM call, evaluating on student work from two university STEM courses. Comparing AI grading decisions against human-assigned ground truth at the rubric-item level, we observe high overall accuracy, with most errors -- 87\% in the best model -- attributable to transcription failures rather than rubric misapplication. We categorize common error modes, including image quality issues, hallucinated content, and incorrect handling of equivalent expressions. These findings highlight both the promise and limitations of LLM-based grading for handwritten mathematics, providing guidance for system design, prompt refinement, and deployment in educational settings.

Problem

Research questions and friction points this paper is trying to address.

automated grading

handwritten mathematics

vision-capable LLMs

multi-step solutions

educational assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-capable LLMs

automated grading

handwritten mathematics