An Empirical Exploration of ChatGPT's Ability to Support Problem Formulation Tasks for Mission Engineering and a Documentation of its Performance Variability

📅 2025-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the quality and consistency of large language models (LLMs) in supporting problem formulation—particularly stakeholder identification—within mission engineering (ME). We conduct the first systematic evaluation of ChatGPT-3.5 on this task, employing multi-threaded parallel prompting, multi-step reasoning, and qualitative content analysis to construct a framework for assessing output variability. Results indicate that while the model demonstrates reasonable reliability in identifying human stakeholders, it exhibits significant limitations in recognizing system-level and environmental stakeholders, controlling abstraction levels, and decoupling interdependent concerns; moreover, its outputs are highly unstable, with poor cross-thread consistency. Our key contribution is the empirical revelation of inherent performance variability in LLMs during ME problem scoping, confirming their current role as supplementary aids rather than substitutes for domain experts in critical problem-definition activities.

Technology Category

Application Category

📝 Abstract
Systems engineering (SE) is evolving with the availability of generative artificial intelligence (AI) and the demand for a systems-of-systems perspective, formalized under the purview of mission engineering (ME) in the US Department of Defense. Formulating ME problems is challenging because they are open-ended exercises that involve translation of ill-defined problems into well-defined ones that are amenable for engineering development. It remains to be seen to which extent AI could assist problem formulation objectives. To that end, this paper explores the quality and consistency of multi-purpose Large Language Models (LLM) in supporting ME problem formulation tasks, specifically focusing on stakeholder identification. We identify a relevant reference problem, a NASA space mission design challenge, and document ChatGPT-3.5's ability to perform stakeholder identification tasks. We execute multiple parallel attempts and qualitatively evaluate LLM outputs, focusing on both their quality and variability. Our findings portray a nuanced picture. We find that the LLM performs well in identifying human-focused stakeholders but poorly in recognizing external systems and environmental factors, despite explicit efforts to account for these. Additionally, LLMs struggle with preserving the desired level of abstraction and exhibit a tendency to produce solution specific outputs that are inappropriate for problem formulation. More importantly, we document great variability among parallel threads, highlighting that LLM outputs should be used with caution, ideally by adopting a stochastic view of their abilities. Overall, our findings suggest that, while ChatGPT could reduce some expert workload, its lack of consistency and domain understanding may limit its reliability for problem formulation tasks.
Problem

Research questions and friction points this paper is trying to address.

Explores ChatGPT's support in mission engineering problem formulation.
Assesses ChatGPT's performance in stakeholder identification tasks.
Documents ChatGPT's variability and limitations in problem formulation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explores ChatGPT's ME problem formulation support
Evaluates LLM outputs quality and variability
Documents ChatGPT's stakeholder identification performance
🔎 Similar Papers
No similar papers found.