Label-efficient Training Updates for Malware Detection over Time

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of data distribution drift caused by temporal evolution and the high cost of manual labeling in malware detection. The authors propose a model-agnostic, unified framework that systematically integrates active learning and semi-supervised learning strategies, and for the first time conduct a cross-platform comparison across Android and Windows architectures. A key innovation lies in the introduction of feature-level distribution drift analysis, which reveals the relationship between feature stability and detection performance. Experimental results demonstrate that the proposed approach achieves detection accuracy comparable to full-label retraining while reducing annotation effort by up to 90%, offering an efficient and practical solution for long-term deployment scenarios.
📝 Abstract
Machine Learning (ML)-based detectors are becoming essential to counter the proliferation of malware. However, common ML algorithms are not designed to cope with the dynamic nature of real-world settings, where both legitimate and malicious software evolve. This distribution drift causes models trained under static assumptions to degrade over time unless they are continuously updated. Regularly retraining these models, however, is expensive, since labeling new acquired data requires costly manual analysis by security experts. To reduce labeling costs and address distribution drift in malware detection, prior work explored active learning (AL) and semi-supervised learning (SSL) techniques. Yet, existing studies (i) are tightly coupled to specific detector architectures and restricted to a specific malware domain, resulting in non-uniform comparisons; and (ii) lack a consistent methodology for analyzing the distribution drift, despite the critical sensitivity of the malware domain to temporal changes. In this work, we bridge this gap by proposing a model-agnostic framework that evaluates an extensive set of AL and SSL techniques, isolated and combined, for Android and Windows malware detection. We show that these techniques, when combined, can reduce manual annotation costs by up to 90% across both domains while achieving comparable detection performance to full-labeling retraining. We also introduce a methodology for feature-level drift analysis that measures feature stability over time, showing its correlation with the detector performance. Overall, our study provides a detailed understanding of how AL and SSL behave under distribution drift and how they can be successfully combined, offering practical insights for the design of effective detectors over time.
Problem

Research questions and friction points this paper is trying to address.

malware detection
distribution drift
label efficiency
active learning
semi-supervised learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

label-efficient learning
distribution drift
active learning
semi-supervised learning
malware detection
🔎 Similar Papers
No similar papers found.
L
Luca Minnei
Department of Electrical and Electronic Engineering, University of Cagliari, Italy
C
Cristian Manca
Department of Electrical and Electronic Engineering, University of Cagliari, Italy
G
Giorgio Piras
Department of Electrical and Electronic Engineering, University of Cagliari, Italy
Angelo Sotgiu
Angelo Sotgiu
Assistant Professor, University of Cagliari
Maura Pintor
Maura Pintor
University of Cagliari
Machine LearningAdversarial Machine LearningComputer Security
D
Daniele Ghiani
Department of Electrical and Electronic Engineering, University of Cagliari, Italy
Davide Maiorca
Davide Maiorca
Associate Professor of Computer Engineering at University of Cagliari, Italy
Computer SecurityPattern RecognitionAdversarial Machine LearningPDFAndroid
G
Giorgio Giacinto
Department of Electrical and Electronic Engineering, University of Cagliari, Italy
Battista Biggio
Battista Biggio
Professor of Computer Engineering, University of Cagliari, Italy
Adversarial Machine LearningAI SecurityMachine LearningComputer Security