🤖 AI Summary
This study addresses the time-intensive and subjective nature of scoring handwritten STEM responses—particularly those involving symbols, calculations, and diagrams—where partial credit scenarios exacerbate scorer variability. Focusing on physics student responses, we conduct multiple rounds of experiments with GPT-4o to evaluate how skill-oriented, fine-grained checklist rubrics, prompt formats, and temperature settings influence scoring reliability, benchmarking against teacher assessments across 20 authentic answer scripts. Results indicate that AI-human total score agreement approaches inter-rater reliability, with strong alignment for high- and low-scoring responses but reduced consistency for mid-level answers due to ambiguous reasoning. Fine-grained rubrics substantially outperform holistic scoring, especially in evaluating conceptual skills over procedural ones. The findings underscore rubric structure as the primary determinant of reliable AI scoring, followed by prompt design, while temperature exerts minimal impact, offering transferable design principles for LLM-based assessment in STEM education.
📝 Abstract
Student responses in STEM assessments are often handwritten and combine symbolic expressions, calculations, and diagrams, creating substantial variation in format and interpretation. Despite their importance for evaluating students' reasoning, such responses are time-consuming to score and prone to rater inconsistency, particularly when partial credit is required. Recent advances in large language models (LLMs) have increased attention to AI-assisted scoring, yet evidence remains limited regarding how rubric design and LLM configurations influence reliability across performance levels. This study examined the reliability of AI-assisted scoring of undergraduate physics constructed responses using GPT-4o. Twenty authentic handwritten exam responses were scored across two rounds by four instructors and by the AI model using skill-based rubrics with differing levels of analytic granularity. Prompting format and temperature settings were systematically varied. Overall, human-AI agreement on total scores was comparable to human inter-rater reliability and was highest for high- and low-performing responses, but declined for mid-level responses involving partial or ambiguous reasoning. Criterion-level analyses showed stronger alignment for clearly defined conceptual skills than for extended procedural judgments. A more fine-grained, checklist-based rubric improved consistency relative to holistic scoring. These findings indicate that reliable AI-assisted scoring depends primarily on clear, well-structured rubrics, while prompting format plays a secondary role and temperature has relatively limited impact. More broadly, the study provides transferable design recommendations for implementing reliable LLM-assisted scoring in STEM contexts through skill-based rubrics and controlled LLM settings.