Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability

πŸ“… 2025-06-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the lack of standardized evaluation benchmarks for flowchart-to-code generation, this paper introduces Flow2Codeβ€”the first cross-lingual, multimodal benchmark comprising 15 programming languages, 5,622 code snippets, and 16,866 flowcharts (including code-, UML-, and pseudocode-derived variants). We propose a standardized evaluation framework that jointly models flowchart image encoding and code sequence generation, augmented with cross-lingual program semantic alignment. Comprehensive evaluation across 13 state-of-the-art multimodal large language models reveals widespread deficiencies in flowchart logical reasoning. Supervised fine-tuning (SFT) substantially improves performance, yielding up to a 37.2% average accuracy gain. Key contributions include: (1) the first dedicated benchmark for flowchart-to-code generation; (2) a reproducible, cross-lingual evaluation framework; and (3) a high-quality, open-source dataset and evaluation toolkit.

Technology Category

Application Category

πŸ“ Abstract
While large language models (LLMs) show promise in code generation, existing benchmarks neglect the flowchart-based code generation. To promote further research on flowchart-based code generation, this work presents Flow2Code, a novel benchmark for flowchart-based code generation evaluation. The evaluation dataset spans 15 programming languages and includes 5,622 code segments paired with 16,866 flowcharts of three types: code, UML, and pseudocode. Extensive experiments with 13 multimodal LLMs reveal that current LLMs can not generate code based on flowcharts perfectly. Besides, experiment results show that the supervised fine-tuning technique contributes greatly to the models' performance. We publicly release our code and datasets at https://github.com/hml-github/Flow2Code.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for flowchart-based code generation
Creating a benchmark (Flow2Code) for flowchart-to-code tasks
Assessing 13 multimodal LLMs across 15 programming languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow2Code benchmark for flowchart-based code generation
Evaluates 13 multimodal LLMs across 15 languages
Supervised fine-tuning boosts model performance significantly
πŸ”Ž Similar Papers
M
Mengliang He
East China Normal University, Shanghai, China
Jiayi Zeng
Jiayi Zeng
East China University
LLM EvaluationBenchmarking
Y
Yankai Jiang
Shanghai AI Lab, Shanghai, China
W
Wei Zhang
East China Normal University, Shanghai, China
Z
Zeming Liu
Beihang University, Beijing, China
X
Xiaoming Shi
East China Normal University, Shanghai, China
A
Aimin Zhou
East China Normal University, Shanghai, China