🤖 AI Summary
This work addresses the lack of benchmarks evaluating multimodal large language models’ (MLLMs) deep mathematical and spatial reasoning capabilities at the image level. We introduce MaRVL-QA, the first benchmark dedicated to mathematical surface diagram understanding, comprising two fine-grained tasks: topological counting and geometric transformation identification. Images and precise annotations are generated via a rigorously disambiguated, procedural function library to ensure high fidelity and controllability. Experiments reveal that state-of-the-art MLLMs perform substantially below human-level accuracy on MaRVL-QA, exposing their reliance on superficial visual cues and fundamental deficits in genuine spatial and topological reasoning. MaRVL-QA thus fills a critical gap in mathematical visual reasoning evaluation and—through its reproducible, programmatically controlled data generation paradigm—provides a targeted diagnostic tool and reliable benchmark for advancing MLLM architectures and training methodologies.
📝 Abstract
A key frontier for Multimodal Large Language Models (MLLMs) is the ability to perform deep mathematical and spatial reasoning directly from images, moving beyond their established success in semantic description. Mathematical surface plots provide a rigorous testbed for this capability, as they isolate the task of reasoning from the semantic noise common in natural images. To measure progress on this frontier, we introduce MaRVL-QA (Mathematical Reasoning over Visual Landscapes), a new benchmark designed to quantitatively evaluate these core reasoning skills. The benchmark comprises two novel tasks: Topological Counting, identifying and enumerating features like local maxima; and Transformation Recognition, recognizing applied geometric transformations. Generated from a curated library of functions with rigorous ambiguity filtering, our evaluation on MaRVL-QA reveals that even state-of-the-art MLLMs struggle significantly, often resorting to superficial heuristics instead of robust spatial reasoning. MaRVL-QA provides a challenging new tool for the research community to measure progress, expose model limitations, and guide the development of MLLMs with more profound reasoning abilities.