Enhancing Android Malware Detection with Retrieval-Augmented Generation

📅 2025-06-28

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Android malware proliferation poses severe threats to user privacy and device security, necessitating highly accurate detection methods. This paper proposes a novel framework integrating static analysis with Retrieval-Augmented Generation (RAG)-enhanced large language models (LLMs). First, static features—including permissions, components, and code structure—are extracted from APK files. Second, a RAG mechanism guides the LLM to generate precise, context-aware functional summaries, effectively mitigating hallucination. Finally, a Transformer model jointly encodes both static features and semantic summaries for classification. To our knowledge, this is the first work to deeply couple RAG-enhanced LLM–derived semantic understanding with traditional static features, significantly improving detection robustness. Evaluated on a custom dataset, our method achieves higher accuracy than state-of-the-art static approaches, particularly excelling in identifying obfuscated malware samples.

Technology Category

Application Category

📝 Abstract

The widespread use of Android applications has made them a prime target for cyberattacks, significantly increasing the risk of malware that threatens user privacy, security, and device functionality. Effective malware detection is thus critical, with static analysis, dynamic analysis, and Machine Learning being widely used approaches. In this work, we focus on a Machine Learning-based method utilizing static features. We first compiled a dataset of benign and malicious APKs and performed static analysis to extract features such as code structure, permissions, and manifest file content, without executing the apps. Instead of relying solely on raw static features, our system uses an LLM to generate high-level functional descriptions of APKs. To mitigate hallucinations, which are a known vulnerability of LLM, we integrated Retrieval-Augmented Generation (RAG), enabling the LLM to ground its output in relevant context. Using carefully designed prompts, we guide the LLM to produce coherent function summaries, which are then analyzed using a transformer-based model, improving detection accuracy over conventional feature-based methods for malware detection.

Problem

Research questions and friction points this paper is trying to address.

Detecting Android malware using static features and ML

Reducing LLM hallucinations with Retrieval-Augmented Generation

Improving detection accuracy via transformer-based model analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses static analysis for APK feature extraction

Integrates Retrieval-Augmented Generation with LLM

Employs transformer-based model for detection

🔎 Similar Papers

Reassessing feature-based Android malware detection in a contemporary context