🤖 AI Summary
To address the failure of conventional closed-set classification in the face of continuously emerging Android malware families, this paper proposes an open-set recognition framework based on permission features. We are the first to adapt the MaxLogit method—originally developed for computer vision—to Android malware analysis. Our approach integrates high-dimensional sparse modeling of Manifest-permission features with gradient-boosted decision trees (GBDT), enabling joint fine-grained classification of known families and reliable detection of unknown ones. The method incurs low computational overhead and exhibits strong scalability. Extensive evaluation across multiple public and private datasets demonstrates significant improvements in unknown-family detection rates (+12.7% to +28.3%), while maintaining a false positive rate below 1.5%. The framework has been integrated into an enterprise-grade mobile security protection system and deployed in production.
📝 Abstract
Malware are malicious programs that are grouped into families based on their penetration technique, source code, and other characteristics. Classifying malware programs into their respective families is essential for building effective defenses against cyber threats. Machine learning models have a huge potential in malware detection on mobile devices, as malware families can be recognized by classifying permission data extracted from Android manifest files. Still, the malware classification task is challenging due to the high-dimensional nature of permission data and the limited availability of training samples. In particular, the steady emergence of new malware families makes it impossible to acquire a comprehensive training set covering all the malware classes. In this work, we present a malware classification system that, on top of classifying known malware, detects new ones. In particular, we combine an open-set recognition technique developed within the computer vision community, namely MaxLogit, with a tree-based Gradient Boosting classifier, which is particularly effective in classifying high-dimensional data. Our solution turns out to be very practical, as it can be seamlessly employed in a standard classification workflow, and efficient, as it adds minimal computational overhead. Experiments on public and proprietary datasets demonstrate the potential of our solution, which has been deployed in a business environment.