🤖 AI Summary
This study addresses the challenge of enhancing cross-domain generalization and feedback quality for large language models (LLMs) in automated scoring of formative assessments across multidisciplinary domains—science, computing, and engineering. We propose a novel framework integrating Evidence-Centered Design (ECD), human-AI collaborative closed-loop prompt engineering, and teacher-student feedback-driven active learning to ensure interpretability, robustness, and continuous improvement of scoring systems. Crucially, we unify these three components to enable cross-course transferability and iterative refinement. Evaluated in authentic educational settings using GPT-4 with chain-of-thought reasoning, our approach achieves up to a 24.5% improvement in scoring accuracy over baseline methods. Moreover, teacher–student collaborative feedback significantly enhances inter-rater consistency and the interpretability of generated feedback. The framework provides a scalable, methodology-driven foundation for AI-enhanced educational assessment.
📝 Abstract
Large language models (LLMs) have created new opportunities to assist teachers and support student learning. Methods such as chain-of-thought (CoT) prompting enable LLMs to grade formative assessments in science, providing scores and relevant feedback to students. However, the extent to which these methods generalize across curricula in multiple domains (such as science, computing, and engineering) remains largely untested. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) principles to develop curriculum-aligned formative assessments and rubrics, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates teacher and student feedback to iteratively refine assessment questions, grading rubrics, and LLM prompts for automated grading. Our findings demonstrate that CoTAL improves GPT-4's scoring performance, achieving gains of up to 24.5% over a non-prompt-engineered baseline. Both teachers and students view CoTAL as effective in scoring and explaining student responses, each providing valuable refinements to enhance grading accuracy and explanation quality.