MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) are typically evaluated on manually cleaned mathematical problem images, lacking benchmarks representative of real-world K–12 educational scenarios—such as low-resolution, distorted, or cluttered smartphone-captured math images. Method: We introduce MathReal, the first benchmark comprising 2,000 authentic, user-collected K–12 math problem images, accompanied by a taxonomy of image degradations spanning three categories and fourteen subtypes, and six fine-grained experimental settings. Our evaluation framework systematically integrates image quality assessment, vision–language alignment, mathematical semantic understanding, and logical reasoning. Contribution/Results: Experiments reveal substantial performance degradation of current MLLMs on real-world images, exposing critical weaknesses in recognition robustness, cross-modal alignment, and symbolic reasoning. MathReal establishes a new, education-grounded benchmark and provides concrete directions for advancing practical multimodal mathematical reasoning research.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels. To comprehensively evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we design six experimental settings that enable a systematic analysis of their performance. Through extensive experimentation, we find that the problem-solving abilities of existing MLLMs are significantly challenged in realistic educational contexts. Based on this, we conduct a thorough analysis of their performance and error patterns, providing insights into their recognition, comprehension, and reasoning capabilities, and outlining directions for future improvements. Data and code: https://github.com/junfeng0288/MathReal.
Problem

Research questions and friction points this paper is trying to address.

Evaluates MLLMs' math reasoning with real-world K-12 images
Addresses gaps in clean vs. authentic multimodal benchmarks
Analyzes MLLMs' performance under image degradation and interference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-world K-12 math dataset MathReal
Three image categories, 14 subcategories
Six experimental settings for evaluation
🔎 Similar Papers
No similar papers found.
J
Jun Feng
Baidu Inc., Beijing, China
Z
Zixin Wang
Nanyang Technological University, Singapore
Z
Zhentao Zhang
Xiaopeng Motors, China
Y
Yue Guo
Gaoling School of Artificial Intelligence, Renmin University of China
Z
Zhihan Zhou
Beihang University, Beijing, China
Xiuyi Chen
Xiuyi Chen
Baidu <<< CASIA
RAGMultiModalDialogue
Z
Zhenyang Li
Baidu Inc., Beijing, China
Dawei Yin
Dawei Yin
Senior Director, Head of Search Science at Baidu
Machine LearningWeb MiningData Mining