🤖 AI Summary
This work addresses the susceptibility of large vision-language models to visual hallucinations in geometric reasoning and the scarcity of mathematically precise chain-of-thought data. To this end, we propose GeoSym, the first scalable neuro-symbolic synthesis framework that integrates type-conditional grammars, a symbolic solver (SymGT), and a high-fidelity rendering pipeline to automatically generate a large-scale, verifiable multimodal geometry dataset comprising 127K problems. We further introduce GeoSym-Bench, an expert-curated evaluation benchmark. Leveraging this synthetic data, supervised fine-tuning combined with verifiable reward-based reinforcement learning (RLVR/GRPO) enables Qwen3-VL-8B to achieve 61.52% accuracy on the MathVerse visual subset—surpassing closed-source models such as Doubao-1.8 by a significant margin of 22.21%—demonstrating the efficacy of verifiable synthetic data for long-horizon logical reasoning.
📝 Abstract
Large Multimodal Models (LMMs) often struggle with geometric reasoning due to visual hallucinations and a lack of mathematically precise Chain-of-Thought (CoT) data. To address this, we propose the GeoSym Engine, an automated and scalable neuro-symbolic framework. By leveraging a type-conditional grammar and an analytic SymGT Solver, it derives exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high-precision geometric diagrams. Using this engine, we construct GeoSym127K, a difficulty-stratified dataset featuring 51K high-resolution images, 127K questions with symbolic ground truths, and 55K answer-verified CoT QA pairs. We also introduce GeoSym-Bench, an expert-curated suite of 511 complex samples for rigorous evaluation. Through extensive supervised fine-tuning (SFT), we demonstrate that GeoSym drives concentrated improvements specifically on diagram-dependent and multi-step geometry tasks. Our Qwen3-VL-8B model gains an absolute +22.21% on the MathVerse Vision-Only subset and reaches 61.52% (+6.19% improvement) on WeMath, mitigating long-horizon logic fragmentation and outperforming advanced closed-source models like Doubao-1.8. Furthermore, applying Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO reveals that initializing from structural SFT checkpoints substantially elevates the performance ceiling over zero-shot RL. Driven by deterministic exact-match signals, this showcases the robust scaling potential of our verifiable reasoning synthesis. Datasets and code are available at https://huggingface.co/datasets/Tomie0506/GeoSym127K and https://github.com/Tomie56/GeoSym127K.