MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark

📅 2024-08-14
🏛️ arXiv.org
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
Existing mathematical reasoning benchmarks heavily rely on synthetic images, failing to capture the complexity of real-world multimodal reasoning involving photographs and textual mathematics. Method: We introduce MathScape—the first hierarchical multimodal benchmark for photorealistic mathematical problems—featuring a novel “scene–semantics–task” three-level taxonomy that systematically integrates authentic images with formal mathematical semantics, thereby addressing the longstanding gap in joint vision-language mathematical reasoning evaluation. Contribution/Results: Leveraging 11 state-of-the-art multimodal large language models (MLLMs), we conduct dual-track evaluation assessing both theoretical understanding and practical application. Empirical results reveal that even top-performing models achieve sub-50% average accuracy, exposing critical weaknesses in cross-modal alignment, symbolic parsing, and multi-step reasoning. MathScape establishes a new, high-challenge, fine-grained, and interpretable evaluation paradigm for multimodal mathematical reasoning.

Technology Category

Application Category

📝 Abstract
With the development of Multimodal Large Language Models (MLLMs), the evaluation of multimodal models in the context of mathematical problems has become a valuable research field. Multimodal visual-textual mathematical reasoning serves as a critical indicator for evaluating the comprehension and complex multi-step quantitative reasoning abilities of MLLMs. However, previous multimodal math benchmarks have not sufficiently integrated visual and textual information. To address this gap, we proposed MathScape, a new benchmark that emphasizes the understanding and application of combined visual and textual information. MathScape is designed to evaluate photo-based math problem scenarios, assessing the theoretical understanding and application ability of MLLMs through a categorical hierarchical approach. We conduct a multi-dimensional evaluation on 11 advanced MLLMs, revealing that our benchmark is challenging even for the most sophisticated models. By analyzing the evaluation results, we identify the limitations of MLLMs, offering valuable insights for enhancing model performance.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' real-world math reasoning abilities
Assessing multimodal math proficiency beyond synthetic content
Bridging gap between digital and real-world math challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-world images paired with math problems
Multi-dimensional evaluation of MLLMs
Benchmark for real-world math reasoning
🔎 Similar Papers
No similar papers found.
Minxuan Zhou
Minxuan Zhou
Illinois Institute of Technology
H
Hao Liang
Peking University
T
Tianpeng Li
Baichuan Inc.
Zhiyu Wu
Zhiyu Wu
DeepSeek-AI, 北京大学
MLLMEmotion RecognitionSemi-Supervised Learning
Mingan Lin
Mingan Lin
baichuan-inc
LLM、MLLM、AI
Linzhuang Sun
Linzhuang Sun
University of Chinese Academy of Sciences
Multimodal Reasoning
Y
Yaqi Zhou
Baichuan Inc.
Y
Yan Zhang
Baichuan Inc.
X
Xiaoqin Huang
Baichuan Inc.
Y
Yicong Chen
Baichuan Inc.
Y
Yujin Qiao
Baichuan Inc.
W
Weipeng Chen
Baichuan Inc.
B
Bin Cui
Peking University
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved
Z
Zenan Zhou
Baichuan Inc.