🤖 AI Summary
This paper addresses three critical challenges in graph neural network–based intrusion detection systems (GIDS): poor reproducibility, weak robustness, and inconsistent evaluation. We conduct a systematic benchmark across four network traffic datasets—including a newly introduced large-scale enterprise dataset—and reveal, for the first time, that mainstream GIDS exhibit irreproducible results under both false positive rate analysis and adversarial attacks. To address these issues, we propose a standardized evaluation framework that quantitatively analyzes the impact of data scale, graph structural representation, and implementation details on model performance; we further identify three key bottlenecks: data bias, inconsistent feature engineering, and missing training configurations. Our contributions include: (1) the first multi-dimensional reproducibility benchmark specifically designed for GIDS; (2) a formal adversarial robustness evaluation protocol; and (3) a practical guideline for developing reproducible GIDS.
📝 Abstract
Network Intrusion Detection Systems (NIDS) are vital for ensuring enterprise security. Recently, Graph-based NIDS (GIDS) have attracted considerable attention because of their capability to effectively capture the complex relationships within the graph structures of data communications. Despite their promise, the reproducibility and replicability of these GIDS remain largely unexplored, posing challenges for developing reliable and robust detection systems. This study bridges this gap by designing a systematic approach to evaluate state-of-the-art GIDS, which includes critically assessing, extending, and clarifying the findings of these systems. We further assess the robustness of GIDS under adversarial attacks. Evaluations were conducted on three public datasets as well as a newly collected large-scale enterprise dataset. Our findings reveal significant performance discrepancies, highlighting challenges related to dataset scale, model inputs, and implementation settings. We demonstrate difficulties in reproducing and replicating results, particularly concerning false positive rates and robustness against adversarial attacks. This work provides valuable insights and recommendations for future research, emphasizing the importance of rigorous reproduction and replication studies in developing robust and generalizable GIDS solutions.