🤖 AI Summary
Android malware proliferation poses severe threats to user privacy and device security, necessitating highly accurate detection methods. This paper proposes a novel framework integrating static analysis with Retrieval-Augmented Generation (RAG)-enhanced large language models (LLMs). First, static features—including permissions, components, and code structure—are extracted from APK files. Second, a RAG mechanism guides the LLM to generate precise, context-aware functional summaries, effectively mitigating hallucination. Finally, a Transformer model jointly encodes both static features and semantic summaries for classification. To our knowledge, this is the first work to deeply couple RAG-enhanced LLM–derived semantic understanding with traditional static features, significantly improving detection robustness. Evaluated on a custom dataset, our method achieves higher accuracy than state-of-the-art static approaches, particularly excelling in identifying obfuscated malware samples.
📝 Abstract
The widespread use of Android applications has made them a prime target for cyberattacks, significantly increasing the risk of malware that threatens user privacy, security, and device functionality. Effective malware detection is thus critical, with static analysis, dynamic analysis, and Machine Learning being widely used approaches. In this work, we focus on a Machine Learning-based method utilizing static features. We first compiled a dataset of benign and malicious APKs and performed static analysis to extract features such as code structure, permissions, and manifest file content, without executing the apps. Instead of relying solely on raw static features, our system uses an LLM to generate high-level functional descriptions of APKs. To mitigate hallucinations, which are a known vulnerability of LLM, we integrated Retrieval-Augmented Generation (RAG), enabling the LLM to ground its output in relevant context. Using carefully designed prompts, we guide the LLM to produce coherent function summaries, which are then analyzed using a transformer-based model, improving detection accuracy over conventional feature-based methods for malware detection.