A Multi-Agent Approach to Validate and Refine LLM-Generated Personalized Math Problems

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study addresses critical shortcomings in personalized math word problems generated by large language models (LLMs), including implausible contexts, poor readability, insufficient realism, and mathematical inaccuracies. To mitigate these issues, the authors propose a multi-agent collaborative framework that models problem generation as an iterative “generate–verify–revise” process. Four specialized agents independently evaluate solvability, realism, readability, and authenticity, guiding targeted revisions accordingly. This work represents the first application of a multi-agent mechanism to automated verification and refinement in LLM-based math problem generation, explicitly distinguishing and addressing distinct error dimensions. Experiments on 600 generated problems demonstrate that a single iteration significantly reduces errors in realism and authenticity. Human evaluations confirm the reliability of the verifier agents in assessing realism, while indicating room for improvement in authenticity judgment.

Technology Category

Application Category

📝 Abstract

Students benefit from math problems contextualized to their interests. Large language models (LLMs) offer promise for efficient personalization at scale. However, LLM-generated personalized problems may often have problems such as unrealistic quantities and contexts, poor readability, limited authenticity with respect to students' experiences, and occasional mathematical inconsistencies. To alleviate these problems, we propose a multi-agent framework that formalizes personalization as an iterative generate--validate--revise process; we use four specialized validator agents targeting the criteria of solvability, realism, readability, and authenticity, respectively. We evaluate our framework on 600 problems drawn from a popular online mathematics homework platform, ASSISTments, personalizing each problem to a fixed set of 20 student interest topics. We compare three refinement strategies that differ in how validation feedback is coordinated into revisions. Results show that authenticity and realism are the most frequent failure modes in initial LLM-personalized problems, but that a single refinement iteration substantially reduces these failures. We further find that different refinement strategies have different strengths on different criteria. We also assess validator reliability via human evaluation. Results show that reliability is highest on realism and lowest on authenticity, highlighting the need for better evaluation protocols that consider teachers' and students' personal characteristics.

Problem

Research questions and friction points this paper is trying to address.

personalized math problems

large language models

realism

authenticity

mathematical consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent framework

personalized math problems

LLM validation

iterative refinement

authenticity assessment

🔎 Similar Papers

MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning

2024-09-18arXiv.orgCitations: 22