MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing optimization modeling benchmarks are confined to purely textual inputs, rendering them inadequate for real-world decision-making scenarios that often involve multimodal (text-and-image) information. This work proposes and constructs MOptBench, the first solver-grounded multimodal optimization modeling benchmark, encompassing six problem categories, 26 subcategories, and three difficulty levels. The benchmark ensures correctness through structured instance generation and rigorous validation via exact solvers, enabling fine-grained evaluation and error attribution. Evaluation of nine multimodal large language models on 780 verified instances reveals that the best-performing model achieves a pass@1 rate of 52.1%, while general-purpose models succeed on only 15.9% of hard instances, and math-specialized models fail entirely—highlighting significant limitations in current models’ capacity for complex multimodal reasoning.

📝 Abstract

Optimization modeling translates real decision-making problems into mathematical optimization models and solver-executable implementations. Although language models are increasingly used to generate optimization formulations and solver code, existing benchmarks are almost entirely text-only. This omits many optimization-modeling tasks that arise in operational practice, where requirements are described in text but instance information is conveyed through visual artifacts such as tables, graphs, maps, schedules, and dashboards. We introduce multimodal optimization modeling, a benchmark setting in which models must construct both a mathematical formulation and executable solver code from a text-and-visual problem specification. To evaluate this setting, we develop a solver-grounded framework that generates structured optimization instances, verifies each with an exact solver, and builds both the model-facing inputs and hidden reference files from the same verified source. We instantiate the framework as MM-OptBench, a benchmark of 780 solver-verified instances spanning 6 optimization families, 26 subcategories, and 3 structural difficulty levels. We evaluate 9 multimodal large language models (MLLMs), including 6 frontier general-purpose models and 3 math-specialized models, with aggregate, family-level, difficulty-level, and failure-mode analyses. The results show that the task remains far from solved: the best two models reach 52.1% and 51.3% pass@1, while on average across the six general-purpose MLLMs, pass@1 is 43.4% on easy instances and 15.9% on hard instances. All three math-specialized MLLMs solve 0/780 instances. Failure attribution shows that errors arise both when extracting instance data from text and visuals and when turning extracted data into solver-correct formulations and code. MM-OptBench provides a testbed for solver-grounded, decision-oriented multimodal intelligence.

Problem

Research questions and friction points this paper is trying to address.

multimodal optimization modeling

solver-grounded benchmark

optimization modeling

visual artifacts

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal optimization modeling

solver-grounded benchmark

MM-OptBench