🤖 AI Summary
This study addresses the performance degradation of Android malware detection models caused by concept drift, adversarial behaviors, and multimodal heterogeneity. To this end, the authors construct McNdroid, a large-scale longitudinal multimodal benchmark spanning 2013–2025 (excluding 2015), which for the first time aligns three distinct feature modalities: static (manifest and smali code), dynamic (sandbox behavioral traces), and graph-structured (function call graphs). Employing temporally separated train-test splits, the work evaluates long-term generalization and reveals a significant decline in cross-modal consistency over time. Through systematic analysis, it uncovers modality-specific drift patterns and the evolutionary dynamics of malware families. Experimental results demonstrate that multimodal fusion substantially outperforms unimodal approaches in long-term settings and confirm the substantial impact of concept drift on detection performance.
📝 Abstract
Machine learning (ML) in real-world systems must contend with concept drift, adversarial actors, and a spectrum of potential features with varying costs and benefits. Malware naturally exhibits all of these complexities, but for the same reason, it is challenging to curate and organize data to study these factors. We present McNdroid, to our knowledge the largest longitudinal multimodal Android malware benchmark for malware detection and drift analysis. McNdroid spans 2013--2025, excluding 2015, and represents each application with three aligned modalities--static features from manifests and smali code, dynamic behavioral features from sandbox execution, and graph-based features from function-call graphs. Using temporally separated splits, we evaluate standard ML and deep-learning detectors across increasing train--test time gaps. Results show clear temporal degradation, while multimodal fusion outperforms the best single modality across long-term temporal gaps. Cross-modal agreement also declines over time, suggesting that drift affects both individual feature spaces and the consistency among modalities. We further analyze modality-specific drift, malware-family evolution, and temporal changes in model explanations. We publicly release McNdroid, benchmark splits, and code to support reproducible research on temporal generalization and robust multimodal learning in security-critical, non-stationary settings.