๐ค AI Summary
Malware behavior auditing faces three core challenges: (1) obfuscated malicious intent within complex applications, (2) distorted evaluation due to scarce fine-grained labeled data, and (3) unverifiable large language model (LLM) outputs prone to hallucination. To address these, we propose MalEvalโthe first systematic framework for evaluating LLMsโ fine-grained Android malware behavior auditing capabilities under realistic constraints. We introduce four analyst-oriented tasks, design domain-specific metrics and a unified workload scoring scheme, and construct verifiable intermediate attribution units by integrating expert reports, updated sensitive API lists, static reachability analysis, and function-level structured representations. Evaluating seven state-of-the-art LLMs on recent malware and benign false-positive samples, we characterize their capabilities and limitations across attribution, explanation, and verification stages. MalEval establishes the first reproducible, verifiable, fine-grained auditing benchmark for Android malware analysis.
๐ Abstract
Automated malware classification has achieved strong detection performance. Yet, malware behavior auditing seeks causal and verifiable explanations of malicious activities -- essential not only to reveal what malware does but also to substantiate such claims with evidence. This task is challenging, as adversarial intent is often hidden within complex, framework-heavy applications, making manual auditing slow and costly. Large Language Models (LLMs) could help address this gap, but their auditing potential remains largely unexplored due to three limitations: (1) scarce fine-grained annotations for fair assessment; (2) abundant benign code obscuring malicious signals; and (3) unverifiable, hallucination-prone outputs undermining attribution credibility. To close this gap, we introduce MalEval, a comprehensive framework for fine-grained Android malware auditing, designed to evaluate how effectively LLMs support auditing under real-world constraints. MalEval provides expert-verified reports and an updated sensitive API list to mitigate ground truth scarcity and reduce noise via static reachability analysis. Function-level structural representations serve as intermediate attribution units for verifiable evaluation. Building on this, we define four analyst-aligned tasks -- function prioritization, evidence attribution, behavior synthesis, and sample discrimination -- together with domain-specific metrics and a unified workload-oriented score. We evaluate seven widely used LLMs on a curated dataset of recent malware and misclassified benign apps, offering the first systematic assessment of their auditing capabilities. MalEval reveals both promising potential and critical limitations across audit stages, providing a reproducible benchmark and foundation for future research on LLM-enhanced malware behavior auditing. MalEval is publicly available at https://github.com/ZhengXR930/MalEval.git