BERTDetect: A Neural Topic Modelling Approach for Android Malware Detection

📅 2025-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Android malware continuously evolves, rendering traditional source-code analysis ineffective against novel variants; meanwhile, app store descriptions contain underexploited semantic cues indicative of malicious functionality. To address this, we propose the first BERTopic-based neural topic modeling approach for Android malware detection, operating directly on application description text. Our method leverages contextual semantic embeddings and unsupervised clustering to automatically discover highly coherent, functionally discriminative topic clusters. Compared to baseline approaches—including LDA and k-means—our method achieves significantly improved topic coherence and malicious pattern recognition, yielding an approximately 10% gain in F1-score on public benchmark datasets. The key contribution lies in pioneering the application of BERTopic to Android malware textual analysis, effectively bridging the gap left by source-code-centric methods in variant detection. This enables lightweight, reverse-engineering-free early threat identification—a novel paradigm for scalable, behavior-aware malware screening.

Technology Category

Application Category

📝 Abstract
Web access today occurs predominantly through mobile devices, with Android representing a significant share of the mobile device market. This widespread usage makes Android a prime target for malicious attacks. Despite efforts to combat malicious attacks through tools like Google Play Protect and antivirus software, new and evolved malware continues to infiltrate Android devices. Source code analysis is effective but limited, as attackers quickly abandon old malware for new variants to evade detection. Therefore, there is a need for alternative methods that complement source code analysis. Prior research investigated clustering applications based on their descriptions and identified outliers in these clusters by API usage as malware. However, these works often used traditional techniques such as Latent Dirichlet Allocation (LDA) and k-means clustering, that do not capture the nuanced semantic structures present in app descriptions. To this end, in this paper, we propose BERTDetect, which leverages the BERTopic neural topic modelling to effectively capture the latent topics in app descriptions. The resulting topic clusters are comparatively more coherent than previous methods and represent the app functionalities well. Our results demonstrate that BERTDetect outperforms other baselines, achieving ~10% relative improvement in F1 score.
Problem

Research questions and friction points this paper is trying to address.

Detect Android malware using neural topic modeling
Improve semantic analysis of app descriptions
Enhance detection accuracy over traditional methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

BERTopic neural topic modeling for Android malware detection
Improved topic clusters coherence over traditional methods
10% F1 score improvement compared to baselines
🔎 Similar Papers
No similar papers found.