🤖 AI Summary
Design rationale (DR) documentation is severely lacking in software architecture decision-making, hindering traceability and maintainability. Method: This work presents the first systematic evaluation of large language models (LLMs) for automated DR generation, using 100 real-world architecture problems. We compare three prompting paradigms—zero-shot, chain-of-thought (CoT), and LLM agents—across five state-of-the-art LLMs, with human expert annotations and quantitative assessment via precision, recall, and F1-score. Contribution/Results: The best-performing configuration achieves an F1-score of 0.389. Between 64.45% and 69.42% of generated rationales are deemed practically useful by experts, while misleading content constitutes only 1.59%–3.24%. The study validates the feasibility of LLM-assisted DR documentation and identifies the applicability boundaries and optimization opportunities of distinct prompting strategies for architecture knowledge generation.
📝 Abstract
Design Rationale (DR) for software architecture decisions refers to the reasoning underlying architectural choices, which provides valuable insights into the different phases of the architecting process throughout software development. However, in practice, DR is often inadequately documented due to a lack of motivation and effort from developers. With the recent advancements in Large Language Models (LLMs), their capabilities in text comprehension, reasoning, and generation may enable the generation and recovery of DR for architecture decisions. In this study, we evaluated the performance of LLMs in generating DR for architecture decisions. First, we collected 50 Stack Overflow (SO) posts, 25 GitHub issues, and 25 GitHub discussions related to architecture decisions to construct a dataset of 100 architecture-related problems. Then, we selected five LLMs to generate DR for the architecture decisions with three prompting strategies, including zero-shot, chain of thought (CoT), and LLM-based agents. With the DR provided by human experts as ground truth, the Precision of LLM-generated DR with the three prompting strategies ranges from 0.267 to 0.278, Recall from 0.627 to 0.715, and F1-score from 0.351 to 0.389. Additionally, 64.45% to 69.42% of the arguments of DR not mentioned by human experts are also helpful, 4.12% to 4.87% of the arguments have uncertain correctness, and 1.59% to 3.24% of the arguments are potentially misleading. Based on the results, we further discussed the pros and cons of the three prompting strategies and the strengths and limitations of the DR generated by LLMs.