🤖 AI Summary
Existing Android malware detection lacks fine-grained, interpretable understanding of adversary behavior at the Tactics, Techniques, and Procedures (TTP) level.
Method: We propose the first MITRE ATT&CK TTP mapping framework for Android applications. Our approach introduces the first Android application-level TTP-annotated dataset and integrates problem transformation (PTA), retrieval-augmented generation (RAG), prompt engineering, and Llama fine-tuning for multi-label TTP attribution. We further employ SHAP to enhance model interpretability.
Contribution/Results: This work achieves the first fine-grained, explainable mapping from Android app behaviors to ATT&CK tactical and technical layers. Experimental results show that XGBoost attains a Jaccard score of 0.9893 and Hamming loss of 0.0054 on tactic classification; fine-tuned Llama achieves 0.9583 and 0.0182, respectively—demonstrating the efficacy of large language models in mobile threat modeling. The framework enables precise threat intelligence analysis and informs defensive response strategies.
📝 Abstract
The widespread adoption of Android devices for sensitive operations like banking and communication has made them prime targets for cyber threats, particularly Advanced Persistent Threats (APT) and sophisticated malware attacks. Traditional malware detection methods rely on binary classification, failing to provide insights into adversarial Tactics, Techniques, and Procedures (TTPs). Understanding malware behavior is crucial for enhancing cybersecurity defenses. To address this gap, we introduce DroidTTP, a framework mapping Android malware behaviors to TTPs based on the MITRE ATT&CK framework. Our curated dataset explicitly links MITRE TTPs to Android applications. We developed an automated solution leveraging the Problem Transformation Approach (PTA) and Large Language Models (LLMs) to map applications to both Tactics and Techniques. Additionally, we employed Retrieval-Augmented Generation (RAG) with prompt engineering and LLM fine-tuning for TTP predictions. Our structured pipeline includes dataset creation, hyperparameter tuning, data augmentation, feature selection, model development, and SHAP-based model interpretability. Among LLMs, Llama achieved the highest performance in Tactic classification with a Jaccard Similarity of 0.9583 and Hamming Loss of 0.0182, and in Technique classification with a Jaccard Similarity of 0.9348 and Hamming Loss of 0.0127. However, the Label Powerset XGBoost model outperformed LLMs, achieving a Jaccard Similarity of 0.9893 for Tactic classification and 0.9753 for Technique classification, with a Hamming Loss of 0.0054 and 0.0050, respectively. While XGBoost showed superior performance, the narrow margin highlights the potential of LLM-based approaches in TTP classification.