Technical Report: Generating the WEB-IDS23 Dataset

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing NIDS evaluations suffer from coarse-grained and outdated labels, limited scale, obsolete attack types, and insufficient coverage of modern Web attacks—leading to model overfitting and poor generalization. To address these limitations, this paper introduces WEB-IDS23, a novel dataset specifically designed for Web attack detection. It features a first-of-its-kind modular traffic generator enabling multi-protocol simulation, randomized modeling, and co-synthesis of benign and malicious flows. The dataset provides 82 flow-level features and 21 fine-grained attack classes. Leveraging protocol-aware simulation, stochastic mutation, and pairing with real-world traffic traces, it synthesizes over 12 million labeled samples comprehensively covering prevalent Web attacks (e.g., SQLi, XSS, RCE, path traversal). Empirical evaluation demonstrates that WEB-IDS23 significantly enhances NIDS model representation learning, cross-scenario generalization, and assessment reliability.

Technology Category

Application Category

📝 Abstract

Anomaly-based Network Intrusion Detection Systems (NIDS) require correctly labelled, representative and diverse datasets for an accurate evaluation and development. However, several widely used datasets do not include labels which are fine-grained enough and, together with small sample sizes, can lead to overfitting issues that also remain undetected when using test data. Additionally, the cybersecurity sector is evolving fast, and new attack mechanisms require the continuous creation of up-to-date datasets. To address these limitations, we developed a modular traffic generator that can simulate a wide variety of benign and malicious traffic. It incorporates multiple protocols, variability through randomization techniques and can produce attacks along corresponding benign traffic, as it occurs in real-world scenarios. Using the traffic generator, we create a dataset capturing over 12 million samples with 82 flow-level features and 21 fine-grained labels. Additionally, we include several web attack types which are often underrepresented in other datasets.

Problem

Research questions and friction points this paper is trying to address.

Develop dataset for accurate NIDS evaluation

Address overfitting with fine-grained labels

Simulate diverse benign and malicious traffic

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular traffic generator

Simulates diverse traffic

Creates fine-grained labeled dataset

🔎 Similar Papers

No similar papers found.