🤖 AI Summary
This study investigates the efficacy and risks of large language models (LLMs) in generating expert-level systems engineering (SE) artifacts. Given SE’s inherent dependence on interdisciplinary knowledge, domain depth, and operational context, the work employs zero-shot prompt engineering—without fine-tuning—to elicit SE documentation from multiple LLMs, benchmarking outputs against expert-generated artifacts as the gold standard. A mixed-method evaluation combines quantitative metrics (semantic similarity, embedding distance) with qualitative expert assessment. The study首次 systematically identifies and categorizes three latent failure modes: premature requirement definition, unsubstantiated numerical estimation, and excessive specification. Results show that state-of-the-art LLMs, under carefully engineered prompts, produce text indistinguishable from expert output on standard NLP metrics; however, expert review reveals critical, high-risk conceptual flaws. The work cautions against uncritical adoption of general-purpose LLM outputs in SE practice and provides a methodological framework and risk taxonomy for deploying trustworthy AI in safety- and mission-critical engineering domains.
📝 Abstract
Multi-purpose Large Language Models (LLMs), a subset of generative Artificial Intelligence (AI), have recently made significant progress. While expectations for LLMs to assist systems engineering (SE) tasks are paramount; the interdisciplinary and complex nature of systems, along with the need to synthesize deep-domain knowledge and operational context, raise questions regarding the efficacy of LLMs to generate SE artifacts, particularly given that they are trained using data that is broadly available on the internet. To that end, we present results from an empirical exploration, where a human expert-generated SE artifact was taken as a benchmark, parsed, and fed into various LLMs through prompt engineering to generate segments of typical SE artifacts. This procedure was applied without any fine-tuning or calibration to document baseline LLM performance. We then adopted a two-fold mixed-methods approach to compare AI generated artifacts against the benchmark. First, we quantitatively compare the artifacts using natural language processing algorithms and find that when prompted carefully, the state-of-the-art algorithms cannot differentiate AI-generated artifacts from the human-expert benchmark. Second, we conduct a qualitative deep dive to investigate how they differ in terms of quality. We document that while the two-material appear very similar, AI generated artifacts exhibit serious failure modes that could be difficult to detect. We characterize these as: premature requirements definition, unsubstantiated numerical estimates, and propensity to overspecify. We contend that this study tells a cautionary tale about why the SE community must be more cautious adopting AI suggested feedback, at least when generated by multi-purpose LLMs.