MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) are typically evaluated on manually cleaned mathematical problem images, lacking benchmarks representative of real-world K–12 educational scenarios—such as low-resolution, distorted, or cluttered smartphone-captured math images. Method: We introduce MathReal, the first benchmark comprising 2,000 authentic, user-collected K–12 math problem images, accompanied by a taxonomy of image degradations spanning three categories and fourteen subtypes, and six fine-grained experimental settings. Our evaluation framework systematically integrates image quality assessment, vision–language alignment, mathematical semantic understanding, and logical reasoning. Contribution/Results: Experiments reveal substantial performance degradation of current MLLMs on real-world images, exposing critical weaknesses in recognition robustness, cross-modal alignment, and symbolic reasoning. MathReal establishes a new, education-grounded benchmark and provides concrete directions for advancing practical multimodal mathematical reasoning research.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels. To comprehensively evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we design six experimental settings that enable a systematic analysis of their performance. Through extensive experimentation, we find that the problem-solving abilities of existing MLLMs are significantly challenged in realistic educational contexts. Based on this, we conduct a thorough analysis of their performance and error patterns, providing insights into their recognition, comprehension, and reasoning capabilities, and outlining directions for future improvements. Data and code: https://github.com/junfeng0288/MathReal.

Problem

Research questions and friction points this paper is trying to address.

Evaluates MLLMs' math reasoning with real-world K-12 images

Addresses gaps in clean vs. authentic multimodal benchmarks

Analyzes MLLMs' performance under image degradation and interference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-world K-12 math dataset MathReal

Three image categories, 14 subcategories

Six experimental settings for evaluation

🔎 Similar Papers

MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark