From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Current large language models (e.g., GPT-4) prioritize answer correctness in mathematical reasoning but lack the capability to diagnose underlying causes of student errors or generate interpretable, pedagogically grounded feedback—limiting their utility in personalized education. Method: We propose a paradigm shift from “binary correctness assessment” to “causal error understanding,” introducing MathCCS—the first multimodal benchmark for mathematical error cause diagnosis—alongside a history-trajectory-driven sequential analysis framework and a multi-agent architecture integrating temporal reasoning agents with multimodal large language model (MLLM) agents. Our approach synergizes MLLMs (Qwen2-VL, LLaVA-OV, GPT-4o), time-series modeling, and an expert-annotated error taxonomy. Contribution/Results: Experiments show 68.3% accuracy in error cause classification and customized feedback rated 7.9/10 (vs. human expert mean of 8.2), significantly outperforming baselines. This establishes a novel, interpretable, and personalized diagnostic and feedback paradigm for AI-enhanced education.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs), such as GPT-4, have demonstrated impressive mathematical reasoning capabilities, achieving near-perfect performance on benchmarks like GSM8K. However, their application in personalized education remains limited due to an overemphasis on correctness over error diagnosis and feedback generation. Current models fail to provide meaningful insights into the causes of student mistakes, limiting their utility in educational contexts. To address these challenges, we present three key contributions. First, we introduce extbf{MathCCS} (Mathematical Classification and Constructive Suggestions), a multi-modal benchmark designed for systematic error analysis and tailored feedback. MathCCS includes real-world problems, expert-annotated error categories, and longitudinal student data. Evaluations of state-of-the-art models, including extit{Qwen2-VL}, extit{LLaVA-OV}, extit{Claude-3.5-Sonnet} and extit{GPT-4o}, reveal that none achieved classification accuracy above 30% or generated high-quality suggestions (average scores below 4/10), highlighting a significant gap from human-level performance. Second, we develop a sequential error analysis framework that leverages historical data to track trends and improve diagnostic precision. Finally, we propose a multi-agent collaborative framework that combines a Time Series Agent for historical analysis and an MLLM Agent for real-time refinement, enhancing error classification and feedback generation. Together, these contributions provide a robust platform for advancing personalized education, bridging the gap between current AI capabilities and the demands of real-world teaching.

Problem

Research questions and friction points this paper is trying to address.

Enhance AI in personalized error diagnosis education

Develop systematic error analysis and feedback benchmark

Improve error classification and feedback generation accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal benchmark MathCCS

Sequential error analysis framework

Multi-agent collaborative framework

🔎 Similar Papers

No similar papers found.