CapGeo: A Caption-Assisted Approach to Geometric Reasoning

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit limited performance in geometric reasoning, primarily due to insufficient visual understanding of geometric diagrams—not inherent deficits in symbolic reasoning. Method: We propose CapGeo, a framework that bridges the semantic gap between visual representations and symbolic reasoning via high-quality geometric diagram captioning and vision-language collaborative inference. Contribution/Results: We introduce CapGeo-Bench—the first captioning benchmark tailored for geometric reasoning—and propose a keypoint-alignment-based metric for caption quality assessment. Crucially, we empirically establish, for the first time, a strong correlation between caption quality and downstream reasoning accuracy. On geometric problem-solving, CapGeo boosts accuracy by 50.4% (to 59.0%) on Qwen2.5-VL-72B and by 28.2% (to 73.0%) on Claude-Opus-4. This work provides both a quantifiable evaluation framework and an effective enhancement paradigm for geometric understanding in MLLMs.

Technology Category

Application Category

📝 Abstract

Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Improving geometric reasoning in multimodal language models

Bridging visual and textual modalities with captions

Evaluating geometric captioning quality through benchmark datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Caption-assisted framework bridges visual and textual modalities

Converts geometric diagrams into concise textual descriptions

Introduces benchmark with keypoint-based metric evaluation

🔎 Similar Papers

FGeo-HyperGNet: Geometry Problem Solving Integrating Formal Symbolic System and Hypergraph Neural Network