A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization

📅 2024-10-03

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

141K/year

🤖 AI Summary

Automated scoring of open-ended short-answer questions (SAGs) faces challenges including high human evaluation costs, low inter-rater consistency, and limited generalizability of existing automated short-answer grading (ASAG) methods—which often require question-specific customization. To address these, we propose GradeOpt, the first ASAG framework based on collaborative multi-LLM agents: a Grader, a Reflector, and an Optimizer. Through joint reasoning and self-reflection, GradeOpt autonomously generates, iteratively refines, and transfers grading guidelines across questions—eliminating reliance on per-question engineering. This design significantly improves cross-question generalization and alignment with human scoring behavior. On PCK/CK educational knowledge scoring tasks, GradeOpt outperforms state-of-the-art baselines in both accuracy and human–machine agreement. Ablation studies confirm the critical contribution of each agent module to overall performance.

Technology Category

Application Category

📝 Abstract

Open-ended short-answer questions (SAGs) have been widely recognized as a powerful tool for providing deeper insights into learners' responses in the context of learning analytics (LA). However, SAGs often present challenges in practice due to the high grading workload and concerns about inconsistent assessments. With recent advancements in natural language processing (NLP), automatic short-answer grading (ASAG) offers a promising solution to these challenges. Despite this, current ASAG algorithms are often limited in generalizability and tend to be tailored to specific questions. In this paper, we propose a unified multi-agent ASAG framework, GradeOpt, which leverages large language models (LLMs) as graders for SAGs. More importantly, GradeOpt incorporates two additional LLM-based agents - the reflector and the refiner - into the multi-agent system. This enables GradeOpt to automatically optimize the original grading guidelines by performing self-reflection on its errors. Through experiments on a challenging ASAG task, namely the grading of pedagogical content knowledge (PCK) and content knowledge (CK) questions, GradeOpt demonstrates superior performance in grading accuracy and behavior alignment with human graders compared to representative baselines. Finally, comprehensive ablation studies confirm the effectiveness of the individual components designed in GradeOpt.

Problem

Research questions and friction points this paper is trying to address.

Automating grading of open-ended short-answer questions (SAGs) to reduce workload.

Improving grading consistency and generalizability in learning analytics (LA) contexts.

Optimizing grading guidelines using LLM-powered multi-agent systems for accuracy.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-powered multi-agent grading framework

Self-optimizing guidelines via reflection agents

Human-aligned accuracy in open-ended answers

🔎 Similar Papers

Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring