GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

📅 2024-12-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) lack rigorous evaluation of fundamental geometric perception capabilities—such as shape understanding, spatial relation reasoning, and abstraction of visual patterns. Method: We introduce GePBench, the first benchmark dedicated to low-level geometric perception. It formally defines and quantifies geometric perception in MLLMs and proposes a structured, generalizable evaluation framework. Leveraging synthetically generated geometric figures and logical constraints, GePBench produces diverse test samples covering multi-granularity tasks—including collinearity judgment, symmetry detection, and topological reasoning. Results: Extensive experiments reveal significant deficiencies in state-of-the-art MLLMs on these geometric tasks. Fine-tuning models on GePBench yields an average accuracy improvement of 4.2% on downstream visual reasoning and chart comprehension tasks, demonstrating the benchmark’s validity and cross-task generalization value.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have achieved significant advancements in integrating visual and linguistic understanding. While existing benchmarks evaluate these models in context-rich, real-life scenarios, they often overlook fundamental perceptual skills essential for environments deviating from everyday realism. In particular, geometric perception, the ability to interpret spatial relationships and abstract visual patterns, remains underexplored. To address this limitation, we introduce GePBench, a novel benchmark designed to assess the geometric perception capabilities of MLLMs. Results from extensive evaluations reveal that current state-of-the-art MLLMs exhibit significant deficiencies in such tasks. Additionally, we demonstrate that models trained with data sourced from GePBench show notable improvements on a wide range of downstream tasks, underscoring the importance of geometric perception as a foundation for advanced multimodal applications. Our code and datasets will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Geometric Understanding
Shape and Spatial Perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

GePBench
Multi-modal Language Models
Geometric Understanding
Shangyu Xing
Shangyu Xing
Master Student, Nanjing University
MultimodalityLarge Language Models
C
Changhao Xiang
National Key Laboratory for Novel Software Technology, Nanjing University, China
Y
Yuteng Han
National Key Laboratory for Novel Software Technology, Nanjing University, China
Y
Yifan Yue
National Key Laboratory for Novel Software Technology, Nanjing University, China
Z
Zhen Wu
National Key Laboratory for Novel Software Technology, Nanjing University, China
X
Xinyu Liu
National Key Laboratory for Novel Software Technology, Nanjing University, China
Z
Zhangtai Wu
National Key Laboratory for Novel Software Technology, Nanjing University, China
F
Fei Zhao
National Key Laboratory for Novel Software Technology, Nanjing University, China
Xinyu Dai
Xinyu Dai
Nanjing University