BERTDetect: A Neural Topic Modelling Approach for Android Malware Detection

📅 2025-03-23

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Android malware continuously evolves, rendering traditional source-code analysis ineffective against novel variants; meanwhile, app store descriptions contain underexploited semantic cues indicative of malicious functionality. To address this, we propose the first BERTopic-based neural topic modeling approach for Android malware detection, operating directly on application description text. Our method leverages contextual semantic embeddings and unsupervised clustering to automatically discover highly coherent, functionally discriminative topic clusters. Compared to baseline approaches—including LDA and k-means—our method achieves significantly improved topic coherence and malicious pattern recognition, yielding an approximately 10% gain in F1-score on public benchmark datasets. The key contribution lies in pioneering the application of BERTopic to Android malware textual analysis, effectively bridging the gap left by source-code-centric methods in variant detection. This enables lightweight, reverse-engineering-free early threat identification—a novel paradigm for scalable, behavior-aware malware screening.

Technology Category

Application Category

📝 Abstract

Web access today occurs predominantly through mobile devices, with Android representing a significant share of the mobile device market. This widespread usage makes Android a prime target for malicious attacks. Despite efforts to combat malicious attacks through tools like Google Play Protect and antivirus software, new and evolved malware continues to infiltrate Android devices. Source code analysis is effective but limited, as attackers quickly abandon old malware for new variants to evade detection. Therefore, there is a need for alternative methods that complement source code analysis. Prior research investigated clustering applications based on their descriptions and identified outliers in these clusters by API usage as malware. However, these works often used traditional techniques such as Latent Dirichlet Allocation (LDA) and k-means clustering, that do not capture the nuanced semantic structures present in app descriptions. To this end, in this paper, we propose BERTDetect, which leverages the BERTopic neural topic modelling to effectively capture the latent topics in app descriptions. The resulting topic clusters are comparatively more coherent than previous methods and represent the app functionalities well. Our results demonstrate that BERTDetect outperforms other baselines, achieving ~10% relative improvement in F1 score.

Problem

Research questions and friction points this paper is trying to address.

Detect Android malware using neural topic modeling

Improve semantic analysis of app descriptions

Enhance detection accuracy over traditional methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

BERTopic neural topic modeling for Android malware detection

Improved topic clusters coherence over traditional methods

10% F1 score improvement compared to baselines

🔎 Similar Papers

Reassessing feature-based Android malware detection in a contemporary context

2023-01-30Citations: 5

TikTok

San Jose, California

Machine Learning Engineer - Search Ads

TikTok

San Jose, California

Authors to Follow