GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the susceptibility of large vision-language models to visual hallucinations in geometric reasoning and the scarcity of mathematically precise chain-of-thought data. To this end, we propose GeoSym, the first scalable neuro-symbolic synthesis framework that integrates type-conditional grammars, a symbolic solver (SymGT), and a high-fidelity rendering pipeline to automatically generate a large-scale, verifiable multimodal geometry dataset comprising 127K problems. We further introduce GeoSym-Bench, an expert-curated evaluation benchmark. Leveraging this synthetic data, supervised fine-tuning combined with verifiable reward-based reinforcement learning (RLVR/GRPO) enables Qwen3-VL-8B to achieve 61.52% accuracy on the MathVerse visual subset—surpassing closed-source models such as Doubao-1.8 by a significant margin of 22.21%—demonstrating the efficacy of verifiable synthetic data for long-horizon logical reasoning.

📝 Abstract

Large Multimodal Models (LMMs) often struggle with geometric reasoning due to visual hallucinations and a lack of mathematically precise Chain-of-Thought (CoT) data. To address this, we propose the GeoSym Engine, an automated and scalable neuro-symbolic framework. By leveraging a type-conditional grammar and an analytic SymGT Solver, it derives exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high-precision geometric diagrams. Using this engine, we construct GeoSym127K, a difficulty-stratified dataset featuring 51K high-resolution images, 127K questions with symbolic ground truths, and 55K answer-verified CoT QA pairs. We also introduce GeoSym-Bench, an expert-curated suite of 511 complex samples for rigorous evaluation. Through extensive supervised fine-tuning (SFT), we demonstrate that GeoSym drives concentrated improvements specifically on diagram-dependent and multi-step geometry tasks. Our Qwen3-VL-8B model gains an absolute +22.21% on the MathVerse Vision-Only subset and reaches 61.52% (+6.19% improvement) on WeMath, mitigating long-horizon logic fragmentation and outperforming advanced closed-source models like Doubao-1.8. Furthermore, applying Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO reveals that initializing from structural SFT checkpoints substantially elevates the performance ceiling over zero-shot RL. Driven by deterministic exact-match signals, this showcases the robust scaling potential of our verifiable reasoning synthesis. Datasets and code are available at https://huggingface.co/datasets/Tomie0506/GeoSym127K and https://github.com/Tomie56/GeoSym127K.

Problem

Research questions and friction points this paper is trying to address.

geometric reasoning

visual hallucinations

Chain-of-Thought

symbolic ground truth

multimodal models

Innovation

Methods, ideas, or system contributions that make the work stand out.

neuro-symbolic reasoning

symbolically-verifiable synthesis

geometric reasoning