GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately evaluate multimodal large language models’ (MLLMs) ability to recognize and adaptively apply geometric principles in geometric problem solving (GPS), representing a critical assessment gap. Method: We introduce GeoBench—the first bilingual geometric reasoning benchmark—covering planar and solid geometry across a five-level hierarchical principle taxonomy. We propose a geometric-principle-driven, hierarchical evaluation framework that uniquely assesses both *principle recognition* and *adaptive application* in tandem. Additionally, we release a meticulously annotated bilingual dataset of 1,789 problems. Contribution/Results: Empirical evaluation across 12 state-of-the-art MLLMs reveals Gemini-2.0-pro-flash achieves the highest score (65.3%), while principle recognition and adaptive application emerge as the primary bottlenecks. This work establishes a new standard and foundational resource for systematic, fine-grained assessment of MLLMs’ geometric reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Geometry problem-solving (GPS), a challenging task requiring both visual comprehension and symbolic reasoning, effectively measures the reasoning capabilities of multimodal large language models (MLLMs). Humans exhibit strong reasoning ability in this task through accurate identification and adaptive application of geometric principles within visual contexts. However, existing benchmarks fail to jointly assess both dimensions of the human-like geometric reasoning mechanism in MLLMs, remaining a critical gap in assessing their ability to tackle GPS. To this end, we introduce GeoSense, the first comprehensive bilingual benchmark designed to systematically evaluate the geometric reasoning abilities of MLLMs through the lens of geometric principles. GeoSense features a five-level hierarchical framework of geometric principles spanning plane and solid geometry, an intricately annotated dataset of 1,789 problems, and an innovative evaluation strategy. Through extensive experiments on GeoSense with various open-source and closed-source MLLMs, we observe that Gemini-2.0-pro-flash performs best, achieving an overall score of $65.3$. Our in-depth analysis reveals that the identification and application of geometric principles remain a bottleneck for leading MLLMs, jointly hindering their reasoning abilities. These findings underscore GeoSense's potential to guide future advancements in MLLMs' geometric reasoning capabilities, paving the way for more robust and human-like reasoning in artificial intelligence.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' geometric reasoning via principle identification
Assessing human-like multimodal reasoning in geometry problem-solving
Bridging the gap in geometric principle application benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive bilingual benchmark for MLLMs
Five-level hierarchical geometric principles framework
Innovative evaluation strategy with annotated dataset
L
Liangyu Xu
Alibaba Group
Y
Yingxiu Zhao
Alibaba Group
J
Jingyun Wang
Beihang University
Yingyao Wang
Yingyao Wang
Alibaba Group, Harbin Institute of Technology
LVLMQuestion AnsweringKnowledge Reasoning
B
Bu Pi
Alibaba Group
C
Chen Wang
Alibaba Group
M
Mingliang Zhang
Alibaba Group
Jihao Gu
Jihao Gu
University College London
Computer Vision
X
Xiang Li
Alibaba Group
Xiaoyong Zhu
Xiaoyong Zhu
Jiangsu University
Electrical MachinesElectrical Vehicle
Jun Song
Jun Song
Shenzhen University
nanophotonics
B
Bo Zheng
Alibaba Group