On Benchmarking Code LLMs for Android Malware Analysis

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Large language models (LLMs) face significant challenges in Android decompiled code malware analysis—particularly due to large function sizes and absent or obfuscated function names. Method: We propose Cama, the first benchmark framework tailored to this task. It introduces a structured output specification (function summary, semantic renaming, and maliciousness score), domain-specific evaluation metrics (consistency, faithfulness, and semantic relevance), and integrates structured prompt engineering with multi-dimensional automated assessment. Contribution/Results: Evaluated on a large-scale benchmark comprising 118 real-world malware samples and over 7.5 million functions, Cama assesses four open-source code-specialized LLMs (e.g., CodeLlama, StarCoder). Results reveal their high sensitivity to missing function names and fundamental limitations in reasoning about malicious logic. Cama establishes a new evaluation paradigm and provides empirical foundations for security-oriented code LLM assessment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated strong capabilities in various code intelligence tasks. However, their effectiveness for Android malware analysis remains underexplored. Decompiled Android code poses unique challenges for analysis, primarily due to its large volume of functions and the frequent absence of meaningful function names. This paper presents Cama, a benchmarking framework designed to systematically evaluate the effectiveness of Code LLMs in Android malware analysis tasks. Cama specifies structured model outputs (comprising function summaries, refined function names, and maliciousness scores) to support key malware analysis tasks, including malicious function identification and malware purpose summarization. Built on these, it integrates three domain-specific evaluation metrics, consistency, fidelity, and semantic relevance, enabling rigorous stability and effectiveness assessment and cross-model comparison. We construct a benchmark dataset consisting of 118 Android malware samples, encompassing over 7.5 million distinct functions, and use Cama to evaluate four popular open-source models. Our experiments provide insights into how Code LLMs interpret decompiled code and quantify the sensitivity to function renaming, highlighting both the potential and current limitations of Code LLMs in malware analysis tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Code LLMs for Android malware analysis effectiveness

Addressing decompiled code challenges in function volume and naming

Benchmarking model performance in malicious function identification tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cama framework benchmarks Code LLMs for malware analysis

Structured outputs include summaries, names, and maliciousness scores

Domain metrics assess consistency, fidelity, and semantic relevance

🔎 Similar Papers

Reassessing feature-based Android malware detection in a contemporary context