DiagramIR: An Automatic Pipeline for Educational Math Diagram Evaluation

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating large language model (LLM)-generated educational geometry diagrams lacks efficient, scalable automated assessment methods. Method: This paper proposes an automated evaluation framework based on LaTeX TikZ as an intermediate representation (IR). It standardizes generated diagrams into structured TikZ code and integrates semantic parsing with LLM-as-a-judge techniques to perform fine-grained assessment of geometric relationships, labeling accuracy, and visual structural consistency. Contribution/Results: The framework achieves high inter-rater agreement (Cohen’s κ = 0.89) surpassing human annotators and outperforms existing baselines significantly. Notably, it enables lightweight models to replicate the performance of large models at less than one-tenth the inference cost. This work establishes a practical, low-cost, and high-fidelity evaluation paradigm for math visualization in education—enabling trustworthy, deployable educational tools.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly being adopted as tools for learning; however, most tools remain text-only, limiting their usefulness for domains where visualizations are essential, such as mathematics. Recent work shows that LLMs are capable of generating code that compiles to educational figures, but a major bottleneck remains: scalable evaluation of these diagrams. We address this by proposing DiagramIR: an automatic and scalable evaluation pipeline for geometric figures. Our method relies on intermediate representations (IRs) of LaTeX TikZ code. We compare our pipeline to other evaluation baselines such as LLM-as-a-Judge, showing that our approach has higher agreement with human raters. This evaluation approach also enables smaller models like GPT-4.1-Mini to perform comparably to larger models such as GPT-5 at a 10x lower inference cost, which is important for deploying accessible and scalable education technologies.
Problem

Research questions and friction points this paper is trying to address.

Evaluating educational math diagrams generated by LLMs automatically and scalably
Overcoming text-only limitations in math learning tools requiring visualizations
Reducing evaluation costs while maintaining human-rater agreement levels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses intermediate representations of TikZ code
Automatically evaluates geometric diagrams at scale
Enables smaller models to match larger ones
🔎 Similar Papers
No similar papers found.