MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language model (MLLM) benchmarks for mathematical reasoning focus predominantly on textual output, neglecting models’ ability to perform precise visual operations via executable code. Method: We introduce MathVizCode—the first systematic benchmark evaluating MLLMs’ code-based visual generation and editing capabilities in mathematical contexts. It covers geometric diagrams, function plots, and three types of statistical charts, and defines two core dimensions—“code generation” and “code editing”—further subdivided into deletion, modification, and annotation operations. Executable code serves as the intermediate representation, enabling fully automated, reproducible, and quantitative evaluation. Contribution/Results: Experiments across nine state-of-the-art MLLMs reveal a substantial performance gap between current models and human-level accuracy in visual code execution, exposing critical bottlenecks in multimodal code understanding and grounded visual reasoning.

Technology Category

Application Category

📝 Abstract
Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM's ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM's code-based capabilities in multi-modal mathematical reasoning.Specifically, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model's ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model's capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.
Problem

Research questions and friction points this paper is trying to address.

Evaluates MLLMs' code-based visual operations in math reasoning
Assesses multi-modal code generation and editing capabilities
Measures fine-grained visual operation accuracy across math figures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Code as intermediate for visual operations
Evaluates multi-modal code generation and editing
Benchmark with diverse mathematical figures
🔎 Similar Papers
No similar papers found.