MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal mathematical benchmarks predominantly focus on single-image scenarios, failing to assess models’ mathematical reasoning capabilities in realistic multi-visual contexts. Method: We introduce MV-MATH, the first K–12 mathematical reasoning benchmark for image–text interleaved, multi-image settings, comprising 2,009 problems across 11 subjects, three difficulty levels, and diverse question types. Leveraging human-curated, education-driven data construction, we systematically define and evaluate multimodal large language models’ (MLLMs’) mathematical reasoning under multi-visual conditions, incorporating multi-granularity annotations, cross-modal alignment design, and fine-grained evaluation protocols. Contribution/Results: Experiments reveal that state-of-the-art MLLMs achieve less than 45% average accuracy on MV-MATH—substantially below human performance—uncovering critical bottlenecks and characteristic error patterns. MV-MATH thus establishes a rigorous, diagnostic benchmark to advance research in multimodal mathematical reasoning.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have shown promising capabilities in mathematical reasoning within visual contexts across various datasets. However, most existing multimodal math benchmarks are limited to single-visual contexts, which diverges from the multi-visual scenarios commonly encountered in real-world mathematical applications. To address this gap, we introduce MV-MATH: a meticulously curated dataset of 2,009 high-quality mathematical problems. Each problem integrates multiple images interleaved with text, derived from authentic K-12 scenarios, and enriched with detailed annotations. MV-MATH includes multiple-choice, free-form, and multi-step questions, covering 11 subject areas across 3 difficulty levels, and serves as a comprehensive and rigorous benchmark for assessing MLLMs' mathematical reasoning in multi-visual contexts. Through extensive experimentation, we observe that MLLMs encounter substantial challenges in multi-visual math tasks, with a considerable performance gap relative to human capabilities on MV-MATH. Furthermore, we analyze the performance and error patterns of various models, providing insights into MLLMs' mathematical reasoning capabilities within multi-visual settings.
Problem

Research questions and friction points this paper is trying to address.

Evaluates multimodal math reasoning in multi-visual contexts.
Introduces MV-MATH dataset for real-world K-12 scenarios.
Analyzes MLLMs' challenges and performance gaps in math tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

MV-MATH dataset for multi-visual math reasoning
Integrates multiple images with text annotations
Assesses MLLMs in multi-visual mathematical contexts
Peijie Wang
Peijie Wang
Institute of Automation Chinese Academy of Sciences
Multimodal LLMsmath reasoning
Zhongzhi Li
Zhongzhi Li
Institute of Automation, Chinese Academy of Sciences
LLMNLPMath Reason
F
Fei Yin
MAIS, Institute of Automation of Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
D
Dekang Ran
MAIS, Institute of Automation of Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
C
Chenglin Liu
MAIS, Institute of Automation of Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences