🤖 AI Summary
Large language models (LLMs) face significant challenges in Android decompiled code malware analysis—particularly due to large function sizes and absent or obfuscated function names.
Method: We propose Cama, the first benchmark framework tailored to this task. It introduces a structured output specification (function summary, semantic renaming, and maliciousness score), domain-specific evaluation metrics (consistency, faithfulness, and semantic relevance), and integrates structured prompt engineering with multi-dimensional automated assessment.
Contribution/Results: Evaluated on a large-scale benchmark comprising 118 real-world malware samples and over 7.5 million functions, Cama assesses four open-source code-specialized LLMs (e.g., CodeLlama, StarCoder). Results reveal their high sensitivity to missing function names and fundamental limitations in reasoning about malicious logic. Cama establishes a new evaluation paradigm and provides empirical foundations for security-oriented code LLM assessment.
📝 Abstract
Large Language Models (LLMs) have demonstrated strong capabilities in various code intelligence tasks. However, their effectiveness for Android malware analysis remains underexplored. Decompiled Android code poses unique challenges for analysis, primarily due to its large volume of functions and the frequent absence of meaningful function names. This paper presents Cama, a benchmarking framework designed to systematically evaluate the effectiveness of Code LLMs in Android malware analysis tasks. Cama specifies structured model outputs (comprising function summaries, refined function names, and maliciousness scores) to support key malware analysis tasks, including malicious function identification and malware purpose summarization. Built on these, it integrates three domain-specific evaluation metrics, consistency, fidelity, and semantic relevance, enabling rigorous stability and effectiveness assessment and cross-model comparison. We construct a benchmark dataset consisting of 118 Android malware samples, encompassing over 7.5 million distinct functions, and use Cama to evaluate four popular open-source models. Our experiments provide insights into how Code LLMs interpret decompiled code and quantify the sensitivity to function renaming, highlighting both the potential and current limitations of Code LLMs in malware analysis tasks.