🤖 AI Summary
This study addresses the insufficient multimodal understanding capability in generative software engineering (SE). We present the first systematic investigation of GPT-4’s multimodal interface for integrating visual modeling languages—specifically UML class and sequence diagrams—with natural language prompts. We propose a novel prompting paradigm wherein visual modeling artifacts serve as a critical modality, and design a task-driven evaluation framework to empirically assess its effectiveness across three core SE tasks: requirements understanding, code generation, and architectural reasoning. Experimental results demonstrate that multimodal (image-text) prompting significantly outperforms text-only baselines, yielding an average accuracy improvement of 37%. This confirms the indispensable role of visual modalities in generative SE and fills a critical research gap concerning the systematic application of multimodal large language models in software engineering.
📝 Abstract
Multimodal GPTs represent a watershed in the interplay between Software Engineering and Generative Artificial Intelligence. GPT-4 accepts image and text inputs, rather than simply natural language. We investigate relevant use cases stemming from these enhanced capabilities of GPT-4. To the best of our knowledge, no other work has investigated similar use cases involving Software Engineering tasks carried out via multimodal GPTs prompted with a mix of diagrams and natural language.