🤖 AI Summary
Current large language models (LLMs) support code generation solely from textual inputs, limiting their ability to leverage widely adopted visual artifacts in software development—such as UML diagrams and flowcharts—thereby constraining architectural consistency and functional accuracy of generated code. To address this, we propose MM-Coder, the first multimodal, multilingual code generation framework designed for industrial-scale programming that synthesizes code aligned with both natural language instructions and visual workflows. We introduce MMc-Instruct, the first visually augmented multimodal instruction dataset, and design MMEval, a dedicated benchmark for evaluating multimodal code generation. MM-Coder employs cross-modal alignment and instruction-tuning strategies to significantly enhance comprehension and faithful execution of graphical semantics. Experiments demonstrate that MM-Coder substantially outperforms text-only baselines on MMEval, while also uncovering critical challenges in current multimodal code generation—including performance degradation on complex visual reasoning tasks and sensitivity to instruction perturbations.
📝 Abstract
The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To bridge this gap, we introduce MM-Coder, a Multilingual Multimodal software developer. MM-Coder integrates visual design inputs-Unified Modeling Language (UML) diagrams and flowcharts (termed Visual Workflow)-with textual instructions to enhance code generation accuracy and architectural alignment. To enable this, we developed MMc-Instruct, a diverse multimodal instruction-tuning dataset including visual-workflow-based code generation, allowing MM-Coder to synthesize textual and graphical information like human developers, distinct from prior work on narrow tasks. Furthermore, we introduce MMEval, a new benchmark for evaluating multimodal code generation, addressing existing text-only limitations. Our evaluations using MMEval highlight significant remaining challenges for models in precise visual information capture, instruction following, and advanced programming knowledge. Our work aims to revolutionize industrial programming by enabling LLMs to interpret and implement complex specifications conveyed through both text and visual designs.