🤖 AI Summary
Automated generation of pedagogically effective visual aids for mathematical word problems (MWPs) remains underexplored in mathematics education.
Method: We introduce the first visualization-generation framework tailored for math education, grounded in teacher interviews to define an education-oriented visual language and design space, along with interpretable, evaluable visual representation criteria. We construct Math2Visual—a benchmark dataset comprising 1,903 image-text pairs—and fine-tune text-to-image (TTI) models (e.g., Stable Diffusion), complemented by a multidimensional human evaluation protocol.
Contribution/Results: Fine-tuned models achieve significantly higher accuracy in representing mathematical relationships. Our analysis reveals, for the first time, systematic failures of existing TTI models in capturing quantitative relations and operational logic—critical dimensions for MWP understanding. We establish a novel evaluation standard centered on instructional effectiveness, providing both methodological foundations and empirical evidence for explainable, educationally grounded visual generation in AI-enhanced learning.
📝 Abstract
Visuals are valuable tools for teaching math word problems (MWPs), helping young learners interpret textual descriptions into mathematical expressions before solving them. However, creating such visuals is labor-intensive and there is a lack of automated methods to support this process. In this paper, we present Math2Visual, an automatic framework for generating pedagogically meaningful visuals from MWP text descriptions. Math2Visual leverages a pre-defined visual language and a design space grounded in interviews with math teachers, to illustrate the core mathematical relationships in MWPs. Using Math2Visual, we construct an annotated dataset of 1,903 visuals and evaluate Text-to-Image (TTI) models for their ability to generate visuals that align with our design. We further fine-tune several TTI models with our dataset, demonstrating improvements in educational visual generation. Our work establishes a new benchmark for automated generation of pedagogically meaningful visuals and offers insights into key challenges in producing multimodal educational content, such as the misrepresentation of mathematical relationships and the omission of essential visual elements.