Synthesizing Multimodal Geometry Datasets from Scratch and Enabling Visual Alignment via Plotting Code

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the challenges in multimodal geometric reasoning—specifically, the misalignment between visual diagrams and symbolic logic and the scarcity of high-quality, complex data. The authors propose GeoCode, a pipeline for synthesizing multimodal geometric problems from scratch. By leveraging symbolic seed generation, geometric validation, procedural rendering (e.g., via Matplotlib or GeoGebra code), and multi-stage consistency checks, GeoCode constructs structurally coherent and mathematically correct datasets. A key innovation is the use of rendering code as an explicit bridge that aligns symbolic and visual representations, enabling their decoupled generation and reframing visual understanding as a structured prediction task. Datasets generated by GeoCode surpass existing benchmarks in complexity and reasoning difficulty, and models trained on them achieve significant performance gains across multiple geometric reasoning tasks.

Technology Category

Application Category

📝 Abstract

Multimodal geometry reasoning requires models to jointly understand visual diagrams and perform structured symbolic inference, yet current vision--language models struggle with complex geometric constructions due to limited training data and weak visual--symbolic alignment. We propose a pipeline for synthesizing complex multimodal geometry problems from scratch and construct a dataset named \textbf{GeoCode}, which decouples problem generation into symbolic seed construction, grounded instantiation with verification, and code-based diagram rendering, ensuring consistency across structure, text, reasoning, and images. Leveraging the plotting code provided in GeoCode, we further introduce code prediction as an explicit alignment objective, transforming visual understanding into a supervised structured prediction task. GeoCode exhibits substantially higher structural complexity and reasoning difficulty than existing benchmarks, while maintaining mathematical correctness through multi-stage validation. Extensive experiments show that models trained on GeoCode achieve consistent improvements on multiple geometry benchmarks, demonstrating both the effectiveness of the dataset and the proposed alignment strategy. The code will be available at https://github.com/would1920/GeoCode.

Problem

Research questions and friction points this paper is trying to address.

multimodal geometry reasoning

vision-language models

visual-symbolic alignment

training data scarcity

geometric constructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal geometry reasoning

dataset synthesis

visual-symbolic alignment