🤖 AI Summary
This study addresses critical shortcomings in personalized math word problems generated by large language models (LLMs), including implausible contexts, poor readability, insufficient realism, and mathematical inaccuracies. To mitigate these issues, the authors propose a multi-agent collaborative framework that models problem generation as an iterative “generate–verify–revise” process. Four specialized agents independently evaluate solvability, realism, readability, and authenticity, guiding targeted revisions accordingly. This work represents the first application of a multi-agent mechanism to automated verification and refinement in LLM-based math problem generation, explicitly distinguishing and addressing distinct error dimensions. Experiments on 600 generated problems demonstrate that a single iteration significantly reduces errors in realism and authenticity. Human evaluations confirm the reliability of the verifier agents in assessing realism, while indicating room for improvement in authenticity judgment.
📝 Abstract
Students benefit from math problems contextualized to their interests. Large language models (LLMs) offer promise for efficient personalization at scale. However, LLM-generated personalized problems may often have problems such as unrealistic quantities and contexts, poor readability, limited authenticity with respect to students' experiences, and occasional mathematical inconsistencies. To alleviate these problems, we propose a multi-agent framework that formalizes personalization as an iterative generate--validate--revise process; we use four specialized validator agents targeting the criteria of solvability, realism, readability, and authenticity, respectively. We evaluate our framework on 600 problems drawn from a popular online mathematics homework platform, ASSISTments, personalizing each problem to a fixed set of 20 student interest topics. We compare three refinement strategies that differ in how validation feedback is coordinated into revisions. Results show that authenticity and realism are the most frequent failure modes in initial LLM-personalized problems, but that a single refinement iteration substantially reduces these failures. We further find that different refinement strategies have different strengths on different criteria. We also assess validator reliability via human evaluation. Results show that reliability is highest on realism and lowest on authenticity, highlighting the need for better evaluation protocols that consider teachers' and students' personal characteristics.