Multilingual Multimodal Software Developer for Code Generation

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) support code generation solely from textual inputs, limiting their ability to leverage widely adopted visual artifacts in software development—such as UML diagrams and flowcharts—thereby constraining architectural consistency and functional accuracy of generated code. To address this, we propose MM-Coder, the first multimodal, multilingual code generation framework designed for industrial-scale programming that synthesizes code aligned with both natural language instructions and visual workflows. We introduce MMc-Instruct, the first visually augmented multimodal instruction dataset, and design MMEval, a dedicated benchmark for evaluating multimodal code generation. MM-Coder employs cross-modal alignment and instruction-tuning strategies to significantly enhance comprehension and faithful execution of graphical semantics. Experiments demonstrate that MM-Coder substantially outperforms text-only baselines on MMEval, while also uncovering critical challenges in current multimodal code generation—including performance degradation on complex visual reasoning tasks and sensitivity to instruction perturbations.

Technology Category

Application Category

📝 Abstract
The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To bridge this gap, we introduce MM-Coder, a Multilingual Multimodal software developer. MM-Coder integrates visual design inputs-Unified Modeling Language (UML) diagrams and flowcharts (termed Visual Workflow)-with textual instructions to enhance code generation accuracy and architectural alignment. To enable this, we developed MMc-Instruct, a diverse multimodal instruction-tuning dataset including visual-workflow-based code generation, allowing MM-Coder to synthesize textual and graphical information like human developers, distinct from prior work on narrow tasks. Furthermore, we introduce MMEval, a new benchmark for evaluating multimodal code generation, addressing existing text-only limitations. Our evaluations using MMEval highlight significant remaining challenges for models in precise visual information capture, instruction following, and advanced programming knowledge. Our work aims to revolutionize industrial programming by enabling LLMs to interpret and implement complex specifications conveyed through both text and visual designs.
Problem

Research questions and friction points this paper is trying to address.

Bridging text-only code generation with visual aids like diagrams
Enhancing code accuracy via multimodal text and visual inputs
Evaluating multimodal code generation with new benchmark MMEval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates UML diagrams and flowcharts with text
Uses MMc-Instruct dataset for multimodal training
Introduces MMEval benchmark for multimodal evaluation
🔎 Similar Papers
No similar papers found.
Linzheng Chai
Linzheng Chai
Beihang University
NLP
J
Jian Yang
Beihang University
Shukai Liu
Shukai Liu
Beihang university
W
Wei Zhang
Beihang University
Liran Wang
Liran Wang
Beihang University
Ke Jin
Ke Jin
Professor at Beijing Institute of Technology
Radiation damageIon Beam Analysishigh entropy alloysNuclear Material
T
Tao Sun
Beihang University
C
Congnan Liu
Alibaba Group
C
Chenchen Zhang
M-A-P
H
Hualei Zhu
Beihang University
J
Jiaheng Liu
Nanjing University
X
Xianjie Wu
Beihang University
G
Ge Zhang
M-A-P
T
Tianyu Liu
M-A-P
Zhoujun Li
Zhoujun Li
Beihang University
Artificial IntelligentNatural Language ProcessingNetwork Security