🤖 AI Summary
Existing multimodal large language model (MLLM) benchmarks for mathematical reasoning focus predominantly on textual output, neglecting models’ ability to perform precise visual operations via executable code. Method: We introduce MathVizCode—the first systematic benchmark evaluating MLLMs’ code-based visual generation and editing capabilities in mathematical contexts. It covers geometric diagrams, function plots, and three types of statistical charts, and defines two core dimensions—“code generation” and “code editing”—further subdivided into deletion, modification, and annotation operations. Executable code serves as the intermediate representation, enabling fully automated, reproducible, and quantitative evaluation. Contribution/Results: Experiments across nine state-of-the-art MLLMs reveal a substantial performance gap between current models and human-level accuracy in visual code execution, exposing critical bottlenecks in multimodal code understanding and grounded visual reasoning.
📝 Abstract
Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM's ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM's code-based capabilities in multi-modal mathematical reasoning.Specifically, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model's ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model's capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.