Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

To address the lack of standardized evaluation benchmarks for flowchart-to-code generation, this paper introduces Flow2Code—the first cross-lingual, multimodal benchmark comprising 15 programming languages, 5,622 code snippets, and 16,866 flowcharts (including code-, UML-, and pseudocode-derived variants). We propose a standardized evaluation framework that jointly models flowchart image encoding and code sequence generation, augmented with cross-lingual program semantic alignment. Comprehensive evaluation across 13 state-of-the-art multimodal large language models reveals widespread deficiencies in flowchart logical reasoning. Supervised fine-tuning (SFT) substantially improves performance, yielding up to a 37.2% average accuracy gain. Key contributions include: (1) the first dedicated benchmark for flowchart-to-code generation; (2) a reproducible, cross-lingual evaluation framework; and (3) a high-quality, open-source dataset and evaluation toolkit.

Technology Category

Application Category

📝 Abstract

While large language models (LLMs) show promise in code generation, existing benchmarks neglect the flowchart-based code generation. To promote further research on flowchart-based code generation, this work presents Flow2Code, a novel benchmark for flowchart-based code generation evaluation. The evaluation dataset spans 15 programming languages and includes 5,622 code segments paired with 16,866 flowcharts of three types: code, UML, and pseudocode. Extensive experiments with 13 multimodal LLMs reveal that current LLMs can not generate code based on flowcharts perfectly. Besides, experiment results show that the supervised fine-tuning technique contributes greatly to the models' performance. We publicly release our code and datasets at https://github.com/hml-github/Flow2Code.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for flowchart-based code generation

Creating a benchmark (Flow2Code) for flowchart-to-code tasks

Assessing 13 multimodal LLMs across 15 programming languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow2Code benchmark for flowchart-based code generation

Evaluates 13 multimodal LLMs across 15 languages

Supervised fine-tuning boosts model performance significantly

🔎 Similar Papers

A Survey on Evaluating Large Language Models in Code Generation Tasks