๐ค AI Summary
This study addresses the challenge of comprehending the complex and decentralized architecture of ROS 2, which poses significant difficulties for developers. It presents the first systematic evaluation of large language modelsโ (LLMsโ) ability to understand real-world ROS 2 system architectures. Leveraging three ROS 2 systems of varying scales, the authors employ a novel algorithm to automatically generate 1,230 architecture-related questions. A factual benchmark is established through controlled experiments and system monitoring, enabling rigorous assessment of nine LLMs via consistency scoring and perplexity analysis. Results show that LLMs achieve an average accuracy of 98.22%โwith Gemini-2.5-Pro reaching 100%โand explanation consistency scores ranging from 0.394 to 0.762, demonstrating their strong potential in aiding architectural understanding. The work also provides a reproducible framework for question generation and evaluation.
๐ Abstract
Context. The most used development framework for robotics software is ROS2. ROS2 architectures are highly complex, with thousands of components communicating in a decentralized fashion. Goal. We aim to evaluate how LLMs can assist in the comprehension of factual information about the architecture of ROS2 systems. Method. We conduct a controlled experiment where we administer 1,230 prompts to 9 LLMs containing architecturally-relevant questions about 3 ROS2 systems with incremental size. We provide a generic algorithm that systematically generates architecturally-relevant questions for a ROS2 system. Then, we (i) assess the accuracy of the answers of the LLMs against a ground truth established via running and monitoring the 3 ROS2 systems and (ii) qualitatively analyse the explanations provided by the LLMs. Results. Almost all questions are answered correctly across all LLMs (mean=98.22%). gemini-2.5-pro performs best (100% accuracy across all prompts and systems), followed by o3 (99.77%), and gemini-2.5-flash (99.72%); the least performing LLM is gpt-4.1 (95%). Only 300/1,230 prompts are incorrectly answered, of which 249 are about the most complex system. The coherence scores in LLM's explanations range from 0.394 for "service references" to 0.762 for "communication path". The mean perplexity varies significantly across models, with chatgpt-4o achieving the lowest score (19.6) and o4-mini the highest (103.6). Conclusions. There is great potential in the usage of LLMs to aid ROS2 developers in comprehending non-trivial aspects of the software architecture of their systems. Nevertheless, developers should be aware of the intrinsic limitations and different performances of the LLMs and take those into account when using them.