Knowledge-to-Data: LLM-Driven Synthesis of Structured Network Traffic for Testbed-Free IDS Evaluation

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical scarcity of large-scale, high-fidelity, labeled cybersecurity datasets that hampers the development of effective intrusion detection systems (IDS). The authors propose a novel approach that leverages large language models (LLMs) as controlled knowledge-to-data engines to synthesize structured network traffic without requiring real traffic traces or dedicated testbeds. By explicitly incorporating protocol specifications, attack semantics, and statistical rules—without fine-tuning or access to original data—the method enables privacy-preserving, on-demand generation of high-fidelity synthetic data. A multi-level validation framework, encompassing distributional similarity, structural comparison, and cross-domain classification, confirms the synthetic data’s fidelity and utility. Evaluated on the AWID3 benchmark, a gradient-boosted classifier trained solely on this synthetic data achieves an F1 score of 0.956 on real-world traffic, demonstrating its practical viability and high realism.

Technology Category

Application Category

📝 Abstract
Realistic, large-scale, and well-labeled cybersecurity datasets are essential for training and evaluating Intrusion Detection Systems (IDS). However, they remain difficult to obtain due to privacy constraints, data sensitivity, and the cost of building controlled collection environments such as testbeds and cyber ranges. This paper investigates whether Large Language Models (LLMs) can operate as controlled knowledge-to-data engines for generating structured synthetic network traffic datasets suitable for IDS research. We propose a methodology that combines protocol documentation, attack semantics, and explicit statistical rules to condition LLMs without fine-tuning or access to raw samples. Using the AWID3 IEEE~802.11 benchmark as a demanding case study, we generate labeled datasets with four state-of-the-art LLMs and assess fidelity through a multi-level validation framework including global similarity metrics, per-feature distribution testing, structural comparison, and cross-domain classification. Results show that, under explicit constraints, LLM-generated datasets can closely approximate the statistical and structural characteristics of real network traffic, enabling gradient-boosting classifiers to achieve F1-scores up to 0.956 when evaluated on real samples. Overall, the findings suggest that constrained LLM-driven generation can facilitate on-demand IDS experimentation, providing a testbed-free, privacy-preserving alternative that overcomes the traditional bottlenecks of physical traffic collection and manual labeling.
Problem

Research questions and friction points this paper is trying to address.

Intrusion Detection Systems
synthetic network traffic
Large Language Models
data generation
privacy-preserving
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven synthesis
structured network traffic
testbed-free IDS evaluation
privacy-preserving data generation
knowledge-to-data
🔎 Similar Papers
No similar papers found.
K
Konstantinos E. Kampourakis
Norwegian University of Science and Technology, 2802 Gjøvik, Norway
V
Vyron Kampourakis
Norwegian University of Science and Technology, 2802 Gjøvik, Norway
E
Efstratios Chatzoglou
University of the Aegean, 83200 Karlovasi, Greece
Georgios Kambourakis
Georgios Kambourakis
Professor, Dept. of Information and Communication Systems Eng., University of the Aegean
Network Security
Stefanos Gritzalis
Stefanos Gritzalis
Professor, Lab. of Systems Security, Dept. of Digital Systems, University of Piraeus, Greece
CybersecuritySecurityInformation SecurityPrivacyData Protection