π€ AI Summary
To address the lack of standardized evaluation benchmarks for flowchart-to-code generation, this paper introduces Flow2Codeβthe first cross-lingual, multimodal benchmark comprising 15 programming languages, 5,622 code snippets, and 16,866 flowcharts (including code-, UML-, and pseudocode-derived variants). We propose a standardized evaluation framework that jointly models flowchart image encoding and code sequence generation, augmented with cross-lingual program semantic alignment. Comprehensive evaluation across 13 state-of-the-art multimodal large language models reveals widespread deficiencies in flowchart logical reasoning. Supervised fine-tuning (SFT) substantially improves performance, yielding up to a 37.2% average accuracy gain. Key contributions include: (1) the first dedicated benchmark for flowchart-to-code generation; (2) a reproducible, cross-lingual evaluation framework; and (3) a high-quality, open-source dataset and evaluation toolkit.
π Abstract
While large language models (LLMs) show promise in code generation, existing benchmarks neglect the flowchart-based code generation. To promote further research on flowchart-based code generation, this work presents Flow2Code, a novel benchmark for flowchart-based code generation evaluation. The evaluation dataset spans 15 programming languages and includes 5,622 code segments paired with 16,866 flowcharts of three types: code, UML, and pseudocode. Extensive experiments with 13 multimodal LLMs reveal that current LLMs can not generate code based on flowcharts perfectly. Besides, experiment results show that the supervised fine-tuning technique contributes greatly to the models' performance. We publicly release our code and datasets at https://github.com/hml-github/Flow2Code.