EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

📅 2026-01-23

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing evaluation methods struggle to comprehensively assess the ability of multimodal large language models to understand handwritten student solutions—comprising mixed formulas, diagrams, and text—as encountered in authentic STEM coursework, and lack high-quality, domain-specific benchmarks. To address this gap, this work constructs and releases EDU-CIRCUIT-HW, a dataset of over 1,300 expert-validated, transcribed, and scored real-world handwritten homework submissions. Leveraging this benchmark, we present the first systematic evaluation of model performance in both upstream recognition fidelity and downstream automated scoring tasks. Furthermore, we propose a lightweight human-in-the-loop correction mechanism that significantly enhances scoring robustness by identifying recurrent error patterns. Experiments demonstrate that reviewing only 3.3% of samples suffices to effectively mitigate latent failures in current models’ comprehension of complex handwritten reasoning.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers'workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs'understanding of complex handwritten logic as a whole. To bridge this gap, we release EDU-CIRCUIT-HW, a dataset consisting of 1,300+ authentic student handwritten solutions from a university-level STEM course. Utilizing the expert-verified verbatim transcriptions and grading reports of student solutions, we simultaneously evaluate various MLLMs'upstream recognition fidelity and downstream auto-grading performance. Our evaluation uncovers an astonishing scale of latent failures within MLLM-recognized student handwritten content, highlighting the models'insufficient reliability for auto-grading and other understanding-oriented applications in high-stakes educational settings. In solution, we present a case study demonstrating that leveraging identified error patterns to preemptively detect and rectify recognition errors, with only minimal human intervention (approximately 4% of the total solutions), can significantly enhance the robustness of the deployed AI-enabled grading system on unseen student solutions.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

handwritten STEM solutions

educational evaluation

auto-grading reliability

domain-specific benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal large language models

handwritten STEM solutions

recognition fidelity