A Multi-Agent Approach to Validate and Refine LLM-Generated Personalized Math Problems

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses critical shortcomings in personalized math word problems generated by large language models (LLMs), including implausible contexts, poor readability, insufficient realism, and mathematical inaccuracies. To mitigate these issues, the authors propose a multi-agent collaborative framework that models problem generation as an iterative “generate–verify–revise” process. Four specialized agents independently evaluate solvability, realism, readability, and authenticity, guiding targeted revisions accordingly. This work represents the first application of a multi-agent mechanism to automated verification and refinement in LLM-based math problem generation, explicitly distinguishing and addressing distinct error dimensions. Experiments on 600 generated problems demonstrate that a single iteration significantly reduces errors in realism and authenticity. Human evaluations confirm the reliability of the verifier agents in assessing realism, while indicating room for improvement in authenticity judgment.
📝 Abstract
Students benefit from math problems contextualized to their interests. Large language models (LLMs) offer promise for efficient personalization at scale. However, LLM-generated personalized problems may often have problems such as unrealistic quantities and contexts, poor readability, limited authenticity with respect to students' experiences, and occasional mathematical inconsistencies. To alleviate these problems, we propose a multi-agent framework that formalizes personalization as an iterative generate--validate--revise process; we use four specialized validator agents targeting the criteria of solvability, realism, readability, and authenticity, respectively. We evaluate our framework on 600 problems drawn from a popular online mathematics homework platform, ASSISTments, personalizing each problem to a fixed set of 20 student interest topics. We compare three refinement strategies that differ in how validation feedback is coordinated into revisions. Results show that authenticity and realism are the most frequent failure modes in initial LLM-personalized problems, but that a single refinement iteration substantially reduces these failures. We further find that different refinement strategies have different strengths on different criteria. We also assess validator reliability via human evaluation. Results show that reliability is highest on realism and lowest on authenticity, highlighting the need for better evaluation protocols that consider teachers' and students' personal characteristics.
Problem

Research questions and friction points this paper is trying to address.

personalized math problems
large language models
realism
authenticity
mathematical consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent framework
personalized math problems
LLM validation
iterative refinement
authenticity assessment
🔎 Similar Papers
F
Fareya Ikram
University of Massachusetts Amherst, Amherst, MA, USA
N
Nischal Ashok Kumar
University of Massachusetts Amherst, Amherst, MA, USA
J
Junyang Lu
University of Massachusetts Amherst, Amherst, MA, USA
H
Hunter McNichols
University of Massachusetts Amherst, Amherst, MA, USA
Candace Walkington
Candace Walkington
Southern Methodist University
Mathematics EducationLearning Sciences
N
Neil Heffernan
Worcester Polytechnic Institute, Worcester, MA, USA
Andrew S. Lan
Andrew S. Lan
University of Massachusetts Amherst
AI in EducationNatural Language ProcessingLearning AnalyticsEducational Data Mining