🤖 AI Summary
This work addresses the limited ability of multimodal large language models to accurately perceive fine-grained visual elements and understand three-dimensional spatial relationships in geometric reasoning. To bridge this gap, the authors introduce a unified formal language that encompasses both planar and solid geometry, along with GDP-29K—a large-scale dataset comprising 29,000 real-world diagram–description pairs. By combining supervised fine-tuning with reinforcement learning guided by verifiable rewards, the proposed approach enables models to precisely parse geometric diagrams into formal descriptions. The method achieves state-of-the-art performance on diagram parsing tasks and substantially enhances downstream geometric reasoning capabilities.
📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable progress but continue to struggle with geometric reasoning, primarily due to the perception bottleneck regarding fine-grained visual elements. While formal languages have aided plane geometry understanding, solid geometry which requires spatial understanding remains largely unexplored. In this paper, we address this challenge by designing a unified formal language that integrates plane and solid geometry, comprehensively covering geometric structures and semantic relations. We construct GDP-29K, a large-scale dataset comprising 20k plane and 9k solid geometry samples collected from diverse real-world sources, each paired with its ground-truth formal description. To ensure syntactic correctness and geometric consistency, we propose a training paradigm that combines Supervised Fine-Tuning with Reinforcement Learning via Verifiable Rewards. Experiments show that our approach achieves state-of-the-art parsing performance. Furthermore, we demonstrate that our parsed formal descriptions serve as a critical cognitive scaffold, significantly boosting MLLMs' capabilities for downstream geometry reasoning tasks. Our data and code are available at Geoparsing.