🤖 AI Summary
Existing approaches struggle to simultaneously ensure label accuracy and visual appeal in educational diagram generation: diffusion models produce aesthetically pleasing but often semantically inaccurate outputs, code-based methods yield precise labels yet visually dull results, and commercial APIs are costly and unreliable. To address this challenge, this work proposes CAGE, a novel framework that synergistically combines programmatic generation with diffusion models. CAGE first leverages a large language model to generate structurally correct, executable chart code, then employs a ControlNet-guided diffusion model to enhance visual quality while strictly preserving semantic fidelity. The authors introduce the EduDiagram-2K dataset and demonstrate CAGE’s effectiveness on 400 K–12 educational diagram prompts, establishing a new paradigm and benchmark for generating high-quality educational multimedia content.
📝 Abstract
Educational diagrams -- labeled illustrations of biological processes, chemical structures, physical systems, and mathematical concepts -- are essential cognitive tools in K-12 instruction. Yet no existing method can generate them both accurately and engagingly. Open-source diffusion models produce visually rich images but catastrophically garble text labels. Code-based generation via LLMs guarantees label correctness but yields visually flat outputs. Closed-source APIs partially bridge this gap but remain unreliable and prohibitively expensive at educational scale. We quantify this accuracy-aesthetics dilemma across all three paradigms on 400 K-12 diagram prompts, measuring both label fidelity and visual quality through complementary automated and human evaluation protocols. To resolve it, we propose CAGE (Code-Anchored Generative Enhancement): an LLM synthesizes executable code producing a structurally correct diagram, then a diffusion model conditioned on the programmatic output via ControlNet refines it into a visually polished graphic while preserving label fidelity. We also introduce EduDiagram-2K, a collection of 2,000 paired programmatic-stylized diagrams enabling this pipeline, and present proof-of-concept results and a research agenda for the multimedia community.