GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation

๐Ÿ“… 2025-10-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large vision-language models (LVLMs) struggle with accurate auxiliary-line generation in geometric reasoning; existing image-editing approaches suffer from insufficient precision, while text descriptions often misalign with underlying spatial structures. Method: We propose GeoVLMathโ€”a novel framework that bypasses image editing entirely and instead generates structured textual descriptions to explicitly encode the auxiliary-line construction process, thereby leveraging LVLMsโ€™ semantic modeling capabilities more effectively. We introduce a cross-modal reward mechanism that quantifies spatial consistency between generated text and ground-truth auxiliary lines, and perform fine-grained reinforcement learning using the GRPO framework on our newly curated dataset, AuxSolidMath. Contribution/Results: Evaluated on multiple auxiliary-line reasoning benchmarks, GeoVLMath significantly outperforms leading open-source and proprietary models at 3B- and 7B-parameter scales, achieving substantial gains in both geometric reasoning accuracy and interpretability.

Technology Category

Application Category

๐Ÿ“ Abstract
Auxiliary lines are essential for solving complex geometric problems but remain challenging for large vision-language models (LVLMs). Rather than editing diagrams to draw auxiliary lines, which current image editing models struggle to render with geometric precision, we generate textual descriptions of auxiliary-line constructions to better align with the representational strengths of LVLMs. To bridge the gap between textual descriptions and spatial structure, we propose a reinforcement learning framework that enhances diagram-text alignment. At the core of our approach is a cross-modal reward that evaluates how well the generated auxiliary-line description for an original diagram matches a ground-truth auxiliary-line diagram. Built on this reward, we present GeoVLMath, an open-source LVLM tailored to auxiliary-line reasoning in solid geometry. This fine-grained signal drives a GRPO-based RL stage, yielding precise diagram-text alignment. To support training, we develop a scalable data creation pipeline and construct AuxSolidMath, a dataset of 3,018 real-exam geometry problems with paired diagrams and aligned textual fields. At the 3B and 7B scales, GeoVLMath achieves competitive and often superior performance compared with strong open-source and proprietary LVLMs on auxiliary-line reasoning benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing geometry reasoning in vision-language models
Improving auxiliary line creation through cross-modal alignment
Bridging textual descriptions with spatial geometric structures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates textual descriptions for auxiliary-line constructions
Uses cross-modal reward for diagram-text alignment
Implements reinforcement learning with GRPO-based training
๐Ÿ”Ž Similar Papers
No similar papers found.