Benchmarking Requirement-to-Architecture Generation with Hybrid Evaluation

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the lack of dedicated datasets and systematic evaluation methodologies for automatically generating software architectures from requirements documents. It introduces R2ABench, a benchmark that comprises a novel dataset of real-world requirements documents paired with expert-annotated PlantUML architecture diagrams, along with a hybrid evaluation framework integrating structural diagram metrics, multidimensional human assessments, and architectural anti-pattern detection. Experimental results demonstrate that while large language models can generate syntactically valid architectures containing key entities, they exhibit significant limitations in relational reasoning. Code-specialized models show modest improvements, whereas agent-based workflows do not consistently enhance performance. This study establishes a standardized benchmark and a multifaceted evaluation suite for the requirements-to-architecture generation task.

Technology Category

Application Category

📝 Abstract

Recently, Large Language Models (LLMs) have demonstrated significant potential in automating software engineering tasks. Generating software architecture designs from requirement documents is a crucial step in software development. However, there is currently a lack of functional datasets tailored for this task. To bridge this gap, we introduce R2ABench (Requirement-To-Architecture Benchmark), a novel benchmark comprising diverse real-world software projects paired with comprehensive Product Requirements Documents (PRDs) and expert-curated PlantUML reference diagrams. Furthermore, we propose a multi-dimensional, hybrid evaluation framework that assesses generated diagrams across three complementary layers: Structural Graph Metrics, Multi-dimensional Scoring, and Architecture Anti-pattern Detection. Using this framework, we conducted a comprehensive empirical study evaluating state-of-the-art models and agentic workflows. Our study shows that LLMs show strong syntactic validity and robust entity extraction but fundamentally struggle with relational reasoning, leading to structurally fragmented architectures. Code-specialized models partially alleviate this limitation, while agent frameworks introduce significant instability rather than consistent improvements. R2ABench provides a robust and standardized foundation for advancing LLM-driven software architecture generation.

Problem

Research questions and friction points this paper is trying to address.

software architecture generation

requirement-to-architecture

benchmark dataset

Large Language Models

empirical evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

R2ABench

hybrid evaluation framework

LLM-driven architecture generation