🤖 AI Summary
To address the scarcity of labeled data and poor model generalization in API malicious request detection, this paper proposes a domain-aware synthetic framework integrating RoBERTa with conditional Generative Adversarial Networks (cGANs). Innovatively, RoBERTa is embedded into the generator to jointly model contextual semantics and security logic of API requests, enabling controllable and interpretable generation of high-fidelity synthetic traffic. By synergizing Transformer-based feature encoding with adversarial training, the method significantly enhances both data diversity and semantic plausibility. Evaluated on the CSIC 2010 and ATRDF 2023 benchmarks, it achieves absolute F1-score improvements of 4.94% and 21.10%, respectively—outperforming existing generic data augmentation approaches. This work establishes a novel paradigm for low-resource web security detection, advancing the state of the art in synthetic data generation for API security analytics.
📝 Abstract
Web applications and APIs face constant threats from malicious actors seeking to exploit vulnerabilities for illicit gains. To defend against these threats, it is essential to have anomaly detection systems that can identify a variety of malicious behaviors. However, a significant challenge in this area is the limited availability of training data. Existing datasets often do not provide sufficient coverage of the diverse API structures, parameter formats, and usage patterns encountered in real-world scenarios. As a result, models trained on these datasets often struggle to generalize and may fail to detect less common or emerging attack vectors. To enhance detection accuracy and robustness, it is crucial to access larger and more representative datasets that capture the true variability of API traffic. To address this, we introduce a GAN-inspired learning framework that extends limited API traffic datasets through targeted, domain-aware synthesis. Drawing on techniques from Natural Language Processing (NLP), our approach leverages Transformer-based architectures, particularly RoBERTa, to enhance the contextual representation of API requests and generate realistic synthetic samples aligned with security-specific semantics. We evaluate our framework on two benchmark datasets, CSIC 2010 and ATRDF 2023, and compare it with a previous data augmentation technique to assess the importance of domain-specific synthesis. In addition, we apply our augmented data to various anomaly detection models to evaluate its impact on classification performance. Our method achieves up to a 4.94% increase in F1 score on CSIC 2010 and up to 21.10% on ATRDF 2023. The source codes of this work are available at https://github.com/ArielCyber/GAN-API.