๐ค AI Summary
Large vision-language models (LVLMs) struggle with accurate auxiliary-line generation in geometric reasoning; existing image-editing approaches suffer from insufficient precision, while text descriptions often misalign with underlying spatial structures. Method: We propose GeoVLMathโa novel framework that bypasses image editing entirely and instead generates structured textual descriptions to explicitly encode the auxiliary-line construction process, thereby leveraging LVLMsโ semantic modeling capabilities more effectively. We introduce a cross-modal reward mechanism that quantifies spatial consistency between generated text and ground-truth auxiliary lines, and perform fine-grained reinforcement learning using the GRPO framework on our newly curated dataset, AuxSolidMath. Contribution/Results: Evaluated on multiple auxiliary-line reasoning benchmarks, GeoVLMath significantly outperforms leading open-source and proprietary models at 3B- and 7B-parameter scales, achieving substantial gains in both geometric reasoning accuracy and interpretability.
๐ Abstract
Auxiliary lines are essential for solving complex geometric problems but remain challenging for large vision-language models (LVLMs). Rather than editing diagrams to draw auxiliary lines, which current image editing models struggle to render with geometric precision, we generate textual descriptions of auxiliary-line constructions to better align with the representational strengths of LVLMs. To bridge the gap between textual descriptions and spatial structure, we propose a reinforcement learning framework that enhances diagram-text alignment. At the core of our approach is a cross-modal reward that evaluates how well the generated auxiliary-line description for an original diagram matches a ground-truth auxiliary-line diagram. Built on this reward, we present GeoVLMath, an open-source LVLM tailored to auxiliary-line reasoning in solid geometry. This fine-grained signal drives a GRPO-based RL stage, yielding precise diagram-text alignment. To support training, we develop a scalable data creation pipeline and construct AuxSolidMath, a dataset of 3,018 real-exam geometry problems with paired diagrams and aligned textual fields. At the 3B and 7B scales, GeoVLMath achieves competitive and often superior performance compared with strong open-source and proprietary LVLMs on auxiliary-line reasoning benchmarks.