🤖 AI Summary
This work addresses the critical scarcity of large-scale, high-fidelity, labeled cybersecurity datasets that hampers the development of effective intrusion detection systems (IDS). The authors propose a novel approach that leverages large language models (LLMs) as controlled knowledge-to-data engines to synthesize structured network traffic without requiring real traffic traces or dedicated testbeds. By explicitly incorporating protocol specifications, attack semantics, and statistical rules—without fine-tuning or access to original data—the method enables privacy-preserving, on-demand generation of high-fidelity synthetic data. A multi-level validation framework, encompassing distributional similarity, structural comparison, and cross-domain classification, confirms the synthetic data’s fidelity and utility. Evaluated on the AWID3 benchmark, a gradient-boosted classifier trained solely on this synthetic data achieves an F1 score of 0.956 on real-world traffic, demonstrating its practical viability and high realism.
📝 Abstract
Realistic, large-scale, and well-labeled cybersecurity datasets are essential for training and evaluating Intrusion Detection Systems (IDS). However, they remain difficult to obtain due to privacy constraints, data sensitivity, and the cost of building controlled collection environments such as testbeds and cyber ranges. This paper investigates whether Large Language Models (LLMs) can operate as controlled knowledge-to-data engines for generating structured synthetic network traffic datasets suitable for IDS research. We propose a methodology that combines protocol documentation, attack semantics, and explicit statistical rules to condition LLMs without fine-tuning or access to raw samples. Using the AWID3 IEEE~802.11 benchmark as a demanding case study, we generate labeled datasets with four state-of-the-art LLMs and assess fidelity through a multi-level validation framework including global similarity metrics, per-feature distribution testing, structural comparison, and cross-domain classification. Results show that, under explicit constraints, LLM-generated datasets can closely approximate the statistical and structural characteristics of real network traffic, enabling gradient-boosting classifiers to achieve F1-scores up to 0.956 when evaluated on real samples. Overall, the findings suggest that constrained LLM-driven generation can facilitate on-demand IDS experimentation, providing a testbed-free, privacy-preserving alternative that overcomes the traditional bottlenecks of physical traffic collection and manual labeling.