🤖 AI Summary
This study addresses the lack of empirical evaluation of large language models (LLMs) in STRIDE threat modeling, particularly for structured threat classification in 5G network security. Method: We systematically assess five state-of-the-art LLMs—including GPT-4, Claude, and Llama—using a novel prompt engineering framework integrating few-shot learning, chain-of-thought reasoning, role-based prompting, and strict output formatting constraints. Contribution/Results: Results reveal substantial performance disparities across STRIDE categories (41%–89% accuracy for Spoofing, Tampering, etc.), exposing inherent model biases and domain knowledge gaps. Performance is shown to depend jointly on threat semantic complexity and training data distribution. We propose a cybersecurity-oriented LLM optimization framework, empirically validating that domain-adapted prompting and lightweight fine-tuning significantly enhance classification robustness. This work establishes a methodological foundation and practical pathway for deploying LLMs in automated, scalable threat modeling.
📝 Abstract
Artificial Intelligence (AI) is expected to be an integral part of next-generation AI-native 6G networks. With the prevalence of AI, researchers have identified numerous use cases of AI in network security. However, there are almost nonexistent studies that analyze the suitability of Large Language Models (LLMs) in network security. To fill this gap, we examine the suitability of LLMs in network security, particularly with the case study of STRIDE threat modeling. We utilize four prompting techniques with five LLMs to perform STRIDE classification of 5G threats. From our evaluation results, we point out key findings and detailed insights along with the explanation of the possible underlying factors influencing the behavior of LLMs in the modeling of certain threats. The numerical results and the insights support the necessity for adjusting and fine-tuning LLMs for network security use cases.