🤖 AI Summary
Existing multimodal large language models (MLLMs) lack comprehensive evaluation benchmarks for logical reasoning and code generation capabilities, particularly in cross-modal algorithmic understanding. Method: We introduce FlowCode, the first flowchart-driven, cross-modal code generation benchmark, which requires models to parse algorithmic semantics from visual flowcharts and generate correct, executable code—establishing a novel “flowchart→code” cross-modal logical reasoning paradigm. FlowCode features fine-grained evaluation subsets covering foundational programming, algorithms, and mathematics. Contribution/Results: Experiments reveal a substantial performance gap between open- and closed-source MLLMs on logical code generation: GPT-4o achieves 79.3% pass@1 on Hard problems, whereas the best open-source model scores only 15%. This demonstrates FlowCode’s superior discriminative power for assessing logical comprehension. All data, code, and evaluation frameworks are publicly released.
📝 Abstract
This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs). It challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart, which visually represents the desired algorithm or process. Code-Vision comprises three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding abilities across basic programming, algorithmic, and mathematical problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision. Experimental results demonstrate that there is a large performance difference between proprietary and open-source models. On Hard problems, GPT-4o can achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further experiments reveal that Code-Vision can pose unique challenges compared to other multimodal reasoning benchmarks MMCode and MathVista. We also explore the reason for the poor performance of the open-source models. All data and codes are available at https://github.com/wanghanbinpanda/CodeVision.