Enhancing Android Malware Detection with Retrieval-Augmented Generation

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Android malware proliferation poses severe threats to user privacy and device security, necessitating highly accurate detection methods. This paper proposes a novel framework integrating static analysis with Retrieval-Augmented Generation (RAG)-enhanced large language models (LLMs). First, static features—including permissions, components, and code structure—are extracted from APK files. Second, a RAG mechanism guides the LLM to generate precise, context-aware functional summaries, effectively mitigating hallucination. Finally, a Transformer model jointly encodes both static features and semantic summaries for classification. To our knowledge, this is the first work to deeply couple RAG-enhanced LLM–derived semantic understanding with traditional static features, significantly improving detection robustness. Evaluated on a custom dataset, our method achieves higher accuracy than state-of-the-art static approaches, particularly excelling in identifying obfuscated malware samples.

Technology Category

Application Category

📝 Abstract
The widespread use of Android applications has made them a prime target for cyberattacks, significantly increasing the risk of malware that threatens user privacy, security, and device functionality. Effective malware detection is thus critical, with static analysis, dynamic analysis, and Machine Learning being widely used approaches. In this work, we focus on a Machine Learning-based method utilizing static features. We first compiled a dataset of benign and malicious APKs and performed static analysis to extract features such as code structure, permissions, and manifest file content, without executing the apps. Instead of relying solely on raw static features, our system uses an LLM to generate high-level functional descriptions of APKs. To mitigate hallucinations, which are a known vulnerability of LLM, we integrated Retrieval-Augmented Generation (RAG), enabling the LLM to ground its output in relevant context. Using carefully designed prompts, we guide the LLM to produce coherent function summaries, which are then analyzed using a transformer-based model, improving detection accuracy over conventional feature-based methods for malware detection.
Problem

Research questions and friction points this paper is trying to address.

Detecting Android malware using static features and ML
Reducing LLM hallucinations with Retrieval-Augmented Generation
Improving detection accuracy via transformer-based model analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses static analysis for APK feature extraction
Integrates Retrieval-Augmented Generation with LLM
Employs transformer-based model for detection
🔎 Similar Papers
No similar papers found.
S
Saraga S.
Department of Computer Applications, Cochin University of Science and Technology, Kochi, India
A
Anagha M. S.
Department of Computer Applications, Cochin University of Science and Technology, Kochi, India
D
Dincy R. Arikkat
Department of Computer Applications, Cochin University of Science and Technology, Kochi, India
R
Rafidha Rehiman K. A.
Department of Computer Applications, Cochin University of Science and Technology, Kochi, India
Serena Nicolazzo
Serena Nicolazzo
Università del Piemonte Orientale
SecurityPrivacyIoTCyber Threat Intelligence
Antonino Nocera
Antonino Nocera
Associate Professor, University of Pavia
Artificial IntelligenceSecurityPrivacyData Science
V
Vinod P.
Department of Computer Applications, Cochin University of Science and Technology, Kochi, India