🤖 AI Summary
This study addresses the significant degradation in adversarial robustness of Android malware detection systems under long-term deployment due to temporal concept drift. Through a decade-long longitudinal analysis integrating data from real devices and emulators, the authors evaluate robustness decay across varying training–testing time configurations using both static and dynamic features. They introduce temporally aware metrics—RobustDrop, ΔASR, and the Adversarial Amplification Factor (AAF)—to systematically quantify, for the first time, the relationship between temporal drift and adversarial robustness. Three realistic evaluation protocols mirroring practical deployment scenarios are also established. Experimental results demonstrate that increasing the temporal gap between training and testing substantially reduces both clean and adversarial accuracy, with FGSM proving particularly effective against static features. While extended-window retraining partially mitigates performance loss, it fails to fully restore robustness.
📝 Abstract
We present a longitudinal, drift-aware evaluation of adversarial robustness across more than a decade of Android applications using static and dynamic feature representations extracted from emulator and real-device executions. The dataset is organized into yearly slices and evaluated under three deployment protocols that emulate realistic learning scenarios: (1) same-year training and testing, (2) cross-year deployment without model updates, and (3) expanding-window retraining with cumulative historical data. Across multiple classifier families, adversarial examples are generated using FGSM and SPSA under feasibility constraints. We measure clean performance, Adversarial Accuracy (AA), Attack Success Rate (ASR), and introduce temporal linkage metrics -- RobustDrop, $Δ$ASR, and Adversarial Amplification Factor (AAF) -- to quantify the relationship between distribution shift and robustness degradation.nResults show that temporal separation is associated with reduced adversarial robustness under the evaluated transfer-based feature-space setting. As the train-test gap increases, clean accuracy and adversarial accuracy decline, while attack success exhibits configuration-dependent increases, particularly under FGSM perturbations and static features. Expanding-window retraining mitigates, but does not eliminate, robustness loss under continued distributional evolution. These findings indicate that temporal drift should be considered when assessing the long-term robustness of intelligent detection systems under evolving data distributions and highlight the need for drift-aware robustness assessment frameworks in long-lived adversarial environments.