Generating Synthetic Malware Samples Using Generative AI

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study addresses the challenge of poor detection performance for minority-class malware due to scarce and imbalanced datasets by proposing a generative AI–based data augmentation approach. The method first converts malware binaries into mnemonic opcode sequences and leverages natural language processing techniques to model their contextual semantics. It then systematically introduces diffusion models—augmented with GAN and WGAN-GP—to generate high-quality synthetic samples, marking the first comprehensive application of such generative frameworks in this domain. Experimental results demonstrate that the proposed approach substantially enhances the recognition of minority-class malware, achieving an average 60% improvement in classification performance and an overall accuracy of 96%, which represents an 8-percentage-point gain over baseline methods. These findings validate the effectiveness and innovative potential of generative models for malware data augmentation.

Technology Category

Application Category

📝 Abstract

Malware attacks have a significant negative impact on organizations of varied scales in the field of cybersecurity. Recently, malware researchers have increasingly turned to machine learning techniques to combat sophisticated obfuscation methods used in malware. However, collecting a diverse set of malware samples with various obfuscation techniques is challenging and often takes years, especially for newly developed malware. This issue is further compounded by a well-known limitation of machine learning models: their poor performance when training data is scarce. In this paper, we propose a new system for generating synthetic malware samples to augment imbalanced malware dataset. Our approach decomposes malware binary samples into mnemonic opcode sequences, leveraging natural language processing to extract contextual meaning behind malware opcode features to aid the learning of generative AI (GenAI) employed in this paper, Generative Adversarial Networks (GAN), Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP), and a modified Diffusion model. The experiment results show that augmenting training data with Diffusion-based synthetic data significantly improves classification performance for minor classes by up to 60% on average. This enhancement ultimately leads to an overall malware classification performance of 96%, an 8% improvement. These findings demonstrate the high quality and fidelity of the synthetic data, its robustness, and its potential applications in malware analysis. Specifically, synthetic malware data proves effective in improving the classification of minor malware classes and detection rates, even though the size of known malware data is significantly small.

Problem

Research questions and friction points this paper is trying to address.

malware

data scarcity

imbalanced dataset

obfuscation

synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic Malware Generation

Generative AI

Diffusion Model