Multilingual Multimodal Software Developer for Code Generation

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Current large language models (LLMs) support code generation solely from textual inputs, limiting their ability to leverage widely adopted visual artifacts in software development—such as UML diagrams and flowcharts—thereby constraining architectural consistency and functional accuracy of generated code. To address this, we propose MM-Coder, the first multimodal, multilingual code generation framework designed for industrial-scale programming that synthesizes code aligned with both natural language instructions and visual workflows. We introduce MMc-Instruct, the first visually augmented multimodal instruction dataset, and design MMEval, a dedicated benchmark for evaluating multimodal code generation. MM-Coder employs cross-modal alignment and instruction-tuning strategies to significantly enhance comprehension and faithful execution of graphical semantics. Experiments demonstrate that MM-Coder substantially outperforms text-only baselines on MMEval, while also uncovering critical challenges in current multimodal code generation—including performance degradation on complex visual reasoning tasks and sensitivity to instruction perturbations.

Technology Category

Application Category

📝 Abstract

The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To bridge this gap, we introduce MM-Coder, a Multilingual Multimodal software developer. MM-Coder integrates visual design inputs-Unified Modeling Language (UML) diagrams and flowcharts (termed Visual Workflow)-with textual instructions to enhance code generation accuracy and architectural alignment. To enable this, we developed MMc-Instruct, a diverse multimodal instruction-tuning dataset including visual-workflow-based code generation, allowing MM-Coder to synthesize textual and graphical information like human developers, distinct from prior work on narrow tasks. Furthermore, we introduce MMEval, a new benchmark for evaluating multimodal code generation, addressing existing text-only limitations. Our evaluations using MMEval highlight significant remaining challenges for models in precise visual information capture, instruction following, and advanced programming knowledge. Our work aims to revolutionize industrial programming by enabling LLMs to interpret and implement complex specifications conveyed through both text and visual designs.

Problem

Research questions and friction points this paper is trying to address.

Bridging text-only code generation with visual aids like diagrams

Enhancing code accuracy via multimodal text and visual inputs

Evaluating multimodal code generation with new benchmark MMEval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates UML diagrams and flowcharts with text

Uses MMc-Instruct dataset for multimodal training

Introduces MMEval benchmark for multimodal evaluation

🔎 Similar Papers

No similar papers found.