🤖 AI Summary
Current large language models (e.g., GPT-4) prioritize answer correctness in mathematical reasoning but lack the capability to diagnose underlying causes of student errors or generate interpretable, pedagogically grounded feedback—limiting their utility in personalized education.
Method: We propose a paradigm shift from “binary correctness assessment” to “causal error understanding,” introducing MathCCS—the first multimodal benchmark for mathematical error cause diagnosis—alongside a history-trajectory-driven sequential analysis framework and a multi-agent architecture integrating temporal reasoning agents with multimodal large language model (MLLM) agents. Our approach synergizes MLLMs (Qwen2-VL, LLaVA-OV, GPT-4o), time-series modeling, and an expert-annotated error taxonomy.
Contribution/Results: Experiments show 68.3% accuracy in error cause classification and customized feedback rated 7.9/10 (vs. human expert mean of 8.2), significantly outperforming baselines. This establishes a novel, interpretable, and personalized diagnostic and feedback paradigm for AI-enhanced education.
📝 Abstract
Large Language Models (LLMs), such as GPT-4, have demonstrated impressive mathematical reasoning capabilities, achieving near-perfect performance on benchmarks like GSM8K. However, their application in personalized education remains limited due to an overemphasis on correctness over error diagnosis and feedback generation. Current models fail to provide meaningful insights into the causes of student mistakes, limiting their utility in educational contexts. To address these challenges, we present three key contributions. First, we introduce extbf{MathCCS} (Mathematical Classification and Constructive Suggestions), a multi-modal benchmark designed for systematic error analysis and tailored feedback. MathCCS includes real-world problems, expert-annotated error categories, and longitudinal student data. Evaluations of state-of-the-art models, including extit{Qwen2-VL}, extit{LLaVA-OV}, extit{Claude-3.5-Sonnet} and extit{GPT-4o}, reveal that none achieved classification accuracy above 30% or generated high-quality suggestions (average scores below 4/10), highlighting a significant gap from human-level performance. Second, we develop a sequential error analysis framework that leverages historical data to track trends and improve diagnostic precision. Finally, we propose a multi-agent collaborative framework that combines a Time Series Agent for historical analysis and an MLLM Agent for real-time refinement, enhancing error classification and feedback generation. Together, these contributions provide a robust platform for advancing personalized education, bridging the gap between current AI capabilities and the demands of real-world teaching.