MalMixer: Few-Shot Malware Classification with Retrieval-Augmented Semi-Supervised Learning

📅 2024-09-20
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of labeled samples—particularly for emerging malware variants—and the heavy reliance on costly manual reverse engineering in fine-grained malware family classification, this paper proposes a domain-knowledge-aware few-shot classification framework. Methodologically, it integrates retrieval-augmented representation learning, semi-supervised learning, graph neural networks, and contrastive learning, and introduces MalMixer, a novel domain-driven feature mixing mechanism that jointly enhances behavioral semantics and structural characteristics of malware. Evaluated on multiple real-world benchmarks, the framework achieves over 92% classification accuracy using only 3–5 labeled samples per family—substantially outperforming existing approaches. It establishes a new state-of-the-art (SOTA) for few-shot malware classification and significantly reduces dependence on manual reverse-engineering efforts.

Technology Category

Application Category

📝 Abstract
Recent growth and proliferation of malware has tested practitioners' ability to promptly classify new samples according to malware families. In contrast to labor-intensive reverse engineering efforts, machine learning approaches have demonstrated increased speed and accuracy. However, most existing deep-learning malware family classifiers must be calibrated using a large number of samples that are painstakingly manually analyzed before training. Furthermore, as novel malware samples arise that are beyond the scope of the training set, additional reverse engineering effort must be employed to update the training set. The sheer volume of new samples found in the wild creates substantial pressure on practitioners' ability to reverse engineer enough malware to adequately train modern classifiers. In this paper, we present MalMixer, a malware family classifier using semi-supervised learning that achieves high accuracy with sparse training data. We present a novel domain-knowledge-aware technique for augmenting malware feature representations, enhancing few-shot performance of semi-supervised malware family classification. We show that MalMixer achieves state-of-the-art performance in few-shot malware family classification settings. Our research confirms the feasibility and effectiveness of lightweight, domain-knowledge-aware feature augmentation methods and highlights the capabilities of similar semi-supervised classifiers in addressing malware classification issues.
Problem

Research questions and friction points this paper is trying to address.

Classifying new malware samples with limited labeled data
Reducing reliance on labor-intensive reverse engineering for training
Improving few-shot malware family classification accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-augmented semi-supervised learning method
Domain-knowledge-aware feature augmentation technique
Few-shot malware classification with high accuracy
🔎 Similar Papers
No similar papers found.
E
Eric Li
Institute for Software Integrated Systems, Vanderbilt University, Nashville, TN, USA; Department of Computer Science, Stanford University, Palo Alto, CA, USA
Y
Yifan Zhang
Institute for Software Integrated Systems, Vanderbilt University, Nashville, TN, USA
Y
Yu Huang
Institute for Software Integrated Systems, Vanderbilt University, Nashville, TN, USA
Kevin Leach
Kevin Leach
Vanderbilt University
Artificial IntelligenceSoftware EngineeringSecurity