CoTAL: Human-in-the-Loop Prompt Engineering, Chain-of-Thought Reasoning, and Active Learning for Generalizable Formative Assessment Scoring

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of enhancing cross-domain generalization and feedback quality for large language models (LLMs) in automated scoring of formative assessments across multidisciplinary domains—science, computing, and engineering. We propose a novel framework integrating Evidence-Centered Design (ECD), human-AI collaborative closed-loop prompt engineering, and teacher-student feedback-driven active learning to ensure interpretability, robustness, and continuous improvement of scoring systems. Crucially, we unify these three components to enable cross-course transferability and iterative refinement. Evaluated in authentic educational settings using GPT-4 with chain-of-thought reasoning, our approach achieves up to a 24.5% improvement in scoring accuracy over baseline methods. Moreover, teacher–student collaborative feedback significantly enhances inter-rater consistency and the interpretability of generated feedback. The framework provides a scalable, methodology-driven foundation for AI-enhanced educational assessment.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have created new opportunities to assist teachers and support student learning. Methods such as chain-of-thought (CoT) prompting enable LLMs to grade formative assessments in science, providing scores and relevant feedback to students. However, the extent to which these methods generalize across curricula in multiple domains (such as science, computing, and engineering) remains largely untested. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) principles to develop curriculum-aligned formative assessments and rubrics, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates teacher and student feedback to iteratively refine assessment questions, grading rubrics, and LLM prompts for automated grading. Our findings demonstrate that CoTAL improves GPT-4's scoring performance, achieving gains of up to 24.5% over a non-prompt-engineered baseline. Both teachers and students view CoTAL as effective in scoring and explaining student responses, each providing valuable refinements to enhance grading accuracy and explanation quality.
Problem

Research questions and friction points this paper is trying to address.

Generalizing LLM-based grading across multiple curricula domains
Automating formative assessment scoring with human feedback
Improving grading accuracy and explanation quality iteratively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-in-the-loop prompt engineering for automated scoring
Evidence-Centered Design for curriculum-aligned assessments
Active Learning to refine grading rubrics iteratively