Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) lack comprehensive evaluation benchmarks for logical reasoning and code generation capabilities, particularly in cross-modal algorithmic understanding. Method: We introduce FlowCode, the first flowchart-driven, cross-modal code generation benchmark, which requires models to parse algorithmic semantics from visual flowcharts and generate correct, executable code—establishing a novel “flowchart→code” cross-modal logical reasoning paradigm. FlowCode features fine-grained evaluation subsets covering foundational programming, algorithms, and mathematics. Contribution/Results: Experiments reveal a substantial performance gap between open- and closed-source MLLMs on logical code generation: GPT-4o achieves 79.3% pass@1 on Hard problems, whereas the best open-source model scores only 15%. This demonstrates FlowCode’s superior discriminative power for assessing logical comprehension. All data, code, and evaluation frameworks are publicly released.

Technology Category

Application Category

📝 Abstract

This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs). It challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart, which visually represents the desired algorithm or process. Code-Vision comprises three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding abilities across basic programming, algorithmic, and mathematical problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision. Experimental results demonstrate that there is a large performance difference between proprietary and open-source models. On Hard problems, GPT-4o can achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further experiments reveal that Code-Vision can pose unique challenges compared to other multimodal reasoning benchmarks MMCode and MathVista. We also explore the reason for the poor performance of the open-source models. All data and codes are available at https://github.com/wanghanbinpanda/CodeVision.

Problem

Research questions and friction points this paper is trying to address.

Evaluates MLLMs' logic understanding and code generation

Challenges MLLMs with flowcharts for specific functionalities

Compares proprietary and open-source MLLMs' performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for MLLMs evaluation

Flowchart-based code generation

Proprietary vs open-source performance analysis

🔎 Similar Papers

No similar papers found.