SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models

πŸ“… 2026-01-11
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge that multimodal large language models struggle to accurately diagnose structural, semantic, and cognitive errors in student-drawn STEM diagrams. To this end, the paper introduces the first diagnostic evaluation benchmark specifically designed for scoring hand-drawn diagrams, encompassing four categories of student responses: geometry, physics, charts, and flowcharts. The authors construct a diverse dataset of 1,015 annotated samples, featuring structured labels and fine-grained error categorization, to systematically assess model performance under symbolic and noisy visual inputs. Experimental results demonstrate that state-of-the-art models significantly underperform human evaluators on this task, thereby validating the benchmark’s effectiveness and its capacity to expose critical gaps in current multimodal reasoning capabilities.

Technology Category

Application Category

πŸ“ Abstract
While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual understanding, they often struggle when faced with the unstructured and ambiguous nature of human-generated sketches. This limitation is particularly pronounced in the underexplored task of visual grading, where models should not only solve a problem but also diagnose errors in hand-drawn diagrams. Such diagnostic capabilities depend on complex structural, semantic, and metacognitive reasoning. To bridge this gap, we introduce SketchJudge, a novel benchmark tailored for evaluating MLLMs as graders of hand-drawn STEM diagrams. SketchJudge encompasses 1,015 hand-drawn student responses across four domains: geometry, physics, charts, and flowcharts, featuring diverse stylistic variations and distinct error types. Evaluations on SketchJudge demonstrate that even advanced MLLMs lag significantly behind humans, validating the benchmark's effectiveness in exposing the fragility of current vision-language alignment in symbolic and noisy contexts. All data, code, and evaluation scripts are publicly available at https://github.com/yuhangsu82/SketchJudge.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
hand-drawn diagrams
visual grading
diagnostic reasoning
STEM education
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models
Visual Grading
Hand-drawn Diagrams
Diagnostic Benchmark
STEM Education
πŸ”Ž Similar Papers
No similar papers found.
Y
Yuhang Su
School of Artificial Intelligence, Beijing Normal University, Beijing, China
Mei Wang
Mei Wang
Beijing Normal University
face recognitionfairness in AIdomain adaptation
Yaoyao Zhong
Yaoyao Zhong
Beijing Normal University
Computer VisionMultimediaAdversarial Robustness
G
Guozhang Li
School of Artificial Intelligence, Beijing Normal University, Beijing, China
S
Shixing Li
School of Artificial Intelligence, Beijing Normal University, Beijing, China
Y
Yihan Feng
School of Artificial Intelligence, Beijing Normal University, Beijing, China
Hua Huang
Hua Huang
Beijing Normal University
Visual ComputingComputer GraphicsComputational Photography