Scaling Up: Revisiting Mining Android Sandboxes at Scale for Malware Classification

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This work addresses the questionable generalizability of Mining Android Sandbox (MAS) in large-scale scenarios. We systematically reproduce and evaluate MAS on a diverse, large-scale dataset comprising 4,076 app pairs spanning 12 malware families—representing a 40× scale-up and significantly enhanced family diversity compared to the original study’s 102 pairs. Our evaluation reveals a sharp decline in MAS’s F1-score from the originally reported 0.90 to 0.54, exposing critical scalability bottlenecks and robustness deficiencies. Notably, MAS fails entirely on specific families (e.g., BankBot, DroidKungFu), confirming its inability to handle real-world complexity. These findings highlight fundamental limitations of behavior-feature–based approaches for large-scale Android malware classification and provide empirical evidence underscoring the necessity of multi-source detection mechanisms—offering concrete directions for improving detection reliability and generalizability.

Technology Category

Application Category

📝 Abstract

The widespread use of smartphones in daily life has raised concerns about privacy and security among researchers and practitioners. Privacy issues are generally highly prevalent in mobile applications, particularly targeting the Android platform, the most popular mobile operating system. For this reason, several techniques have been proposed to identify malicious behavior in Android applications, including the Mining Android Sandbox approach (MAS approach), which aims to identify malicious behavior in repackaged Android applications (apps). However, previous empirical studies evaluated the MAS approach using a small dataset consisting of only 102 pairs of original and repackaged apps. This limitation raises questions about the external validity of their findings and whether the MAS approach can be generalized to larger datasets. To address these concerns, this paper presents the results of a replication study focused on evaluating the performance of the MAS approach regarding its capabilities of correctly classifying malware from different families. Unlike previous studies, our research employs a dataset that is an order of magnitude larger, comprising 4,076 pairs of apps covering a more diverse range of Android malware families. Surprisingly, our findings indicate a poor performance of the MAS approach for identifying malware, with the F1-score decreasing from 0.90 for the small dataset used in the previous studies to 0.54 in our more extensive dataset. Upon closer examination, we discovered that certain malware families partially account for the low accuracy of the MAS approach, which fails to classify a repackaged version of an app as malware correctly. Our findings highlight the limitations of the MAS approach, particularly when scaled, and underscore the importance of complementing it with other techniques to detect a broader range of malware effectively.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MAS approach for malware classification on larger datasets

Assessing MAS performance across diverse Android malware families

Identifying limitations of MAS in scaled repackaged app detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Larger dataset with 4,076 app pairs

Evaluated MAS approach performance decline

Highlighted need for complementary malware techniques

🔎 Similar Papers

Reassessing feature-based Android malware detection in a contemporary context

2023-01-30Citations: 5

Revisiting Static Feature-Based Android Malware Detection

2024-09-11arXiv.orgCitations: 1

💼 Related Jobs

Machine Learning Engineer