Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding

📅 2026-04-05

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the limited capability of existing vision-language models (VLMs) in comprehending software architecture diagrams—structured engineering artifacts—and the absence of dedicated evaluation benchmarks. To bridge this gap, we introduce SADU, the first VLM benchmark tailored to the software design phase, comprising 154 architecture diagrams of behavioral, structural, and entity-relationship types along with 2,431 multimodal question-answering tasks. We systematically evaluate 11 prominent VLMs, including Gemini, GPT, Claude, and Qwen, on tasks such as counting and retrieval-based reasoning. The top-performing model, gemini-3-flash-preview, achieves only 70.18% accuracy, revealing significant limitations in current VLMs’ ability to parse diagrammatic structures and localize visual relationships. This benchmark establishes a new paradigm for advancing VLM research in software engineering contexts.

Technology Category

Application Category

📝 Abstract

Software architecture diagrams are important design artifacts for communicating system structure, behavior, and data organization throughout the software development lifecycle. Although recent progress in large language models has substantially advanced code-centric software engineering tasks such as code generation, testing, and maintenance, the ability of modern vision-language models (VLMs) to understand software architecture diagrams remains underexplored. To address this gap, we present SADU, a benchmark for Software Architecture Diagram Understanding that evaluates VLMs on architecture diagrams as structured software engineering artifacts rather than generic images. SADU contains 154 carefully curated diagrams spanning behavioral, structural, and ER diagrams, paired with structured annotations and 2,431 question-answer tasks covering counting and retrieval reasoning. We evaluate 11 state-of-the-art VLMs from the Gemini, Claude, GPT, and Qwen families. Our results show that software architecture diagram understanding remains challenging for current models: the best-performing model gemini-3-flash-preview achieves only 70.18\% accuracy, while gpt-4o-mini only achieves 17.77\% accuracy. The results further reveal the weaknesses in diagram reasoning and visual relation grounding, highlighting a gap between current VLMs and the needs of design-stage software engineering. SADU provides a foundation for future research on diagram-aware AI systems and more faithful AI-assisted software engineering workflows.

Problem

Research questions and friction points this paper is trying to address.

Software Architecture Diagram Understanding

Vision-Language Models

Diagram Reasoning

Visual Relation Grounding

AI-assisted Software Engineering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Software Architecture Diagram Understanding

Vision-Language Models

Benchmarking