🤖 AI Summary
To address the low efficiency, lack of interpretability, and poor operationality of manual grading for programming assignments, this paper proposes the first Chain-of-Thought (CoT)-based multidimensional automated scoring framework. Our method models functional correctness, code quality, and algorithmic efficiency as semantically interrelated dimensions, enabling end-to-end, context-aware, and transparent evaluation via structured CoT reasoning. Unlike conventional prompting approaches, our framework explicitly encodes logical dependencies among scoring dimensions—a novel contribution to programming assessment—and incorporates an expert-annotation validation protocol. Experiments on 30 Python programming tasks spanning easy, medium, and hard difficulty levels demonstrate that our framework achieves high agreement with human experts (Cohen’s κ = 0.89), significantly outperforming existing baselines. It advances state-of-the-art performance in accuracy, interpretability, and fairness—three critical dimensions of equitable and trustworthy automated assessment.
📝 Abstract
Grading programming assignments is a labor-intensive and time-consuming process that demands careful evaluation across multiple dimensions of the code. To overcome these challenges, automated grading systems are leveraged to enhance efficiency and reduce the workload on educators. Traditional automated grading systems often focus solely on correctness, failing to provide interpretable evaluations or actionable feedback for students. This study introduces StepGrade, which explores the use of Chain-of-Thought (CoT) prompting with Large Language Models (LLMs) as an innovative solution to address these challenges. Unlike regular prompting, which offers limited and surface-level outputs, CoT prompting allows the model to reason step-by-step through the interconnected grading criteria, i.e., functionality, code quality, and algorithmic efficiency, ensuring a more comprehensive and transparent evaluation. This interconnectedness necessitates the use of CoT to systematically address each criterion while considering their mutual influence. To empirically validate the efficiency of StepGrade, we conducted a case study involving 30 Python programming assignments across three difficulty levels (easy, intermediate, and advanced). The approach is validated against expert human evaluations to assess its consistency, accuracy, and fairness. Results demonstrate that CoT prompting significantly outperforms regular prompting in both grading quality and interpretability. By reducing the time and effort required for manual grading, this research demonstrates the potential of GPT-4 with CoT prompting to revolutionize programming education through scalable and pedagogically effective automated grading systems.