Beyond Classification: Evaluating LLMs for Fine-Grained Automatic Malware Behavior Auditing

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Malware behavior auditing faces three core challenges: (1) obfuscated malicious intent within complex applications, (2) distorted evaluation due to scarce fine-grained labeled data, and (3) unverifiable large language model (LLM) outputs prone to hallucination. To address these, we propose MalEval—the first systematic framework for evaluating LLMs’ fine-grained Android malware behavior auditing capabilities under realistic constraints. We introduce four analyst-oriented tasks, design domain-specific metrics and a unified workload scoring scheme, and construct verifiable intermediate attribution units by integrating expert reports, updated sensitive API lists, static reachability analysis, and function-level structured representations. Evaluating seven state-of-the-art LLMs on recent malware and benign false-positive samples, we characterize their capabilities and limitations across attribution, explanation, and verification stages. MalEval establishes the first reproducible, verifiable, fine-grained auditing benchmark for Android malware analysis.

Technology Category

Application Category

📝 Abstract

Automated malware classification has achieved strong detection performance. Yet, malware behavior auditing seeks causal and verifiable explanations of malicious activities -- essential not only to reveal what malware does but also to substantiate such claims with evidence. This task is challenging, as adversarial intent is often hidden within complex, framework-heavy applications, making manual auditing slow and costly. Large Language Models (LLMs) could help address this gap, but their auditing potential remains largely unexplored due to three limitations: (1) scarce fine-grained annotations for fair assessment; (2) abundant benign code obscuring malicious signals; and (3) unverifiable, hallucination-prone outputs undermining attribution credibility. To close this gap, we introduce MalEval, a comprehensive framework for fine-grained Android malware auditing, designed to evaluate how effectively LLMs support auditing under real-world constraints. MalEval provides expert-verified reports and an updated sensitive API list to mitigate ground truth scarcity and reduce noise via static reachability analysis. Function-level structural representations serve as intermediate attribution units for verifiable evaluation. Building on this, we define four analyst-aligned tasks -- function prioritization, evidence attribution, behavior synthesis, and sample discrimination -- together with domain-specific metrics and a unified workload-oriented score. We evaluate seven widely used LLMs on a curated dataset of recent malware and misclassified benign apps, offering the first systematic assessment of their auditing capabilities. MalEval reveals both promising potential and critical limitations across audit stages, providing a reproducible benchmark and foundation for future research on LLM-enhanced malware behavior auditing. MalEval is publicly available at https://github.com/ZhengXR930/MalEval.git

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for fine-grained malware behavior auditing

Addressing scarce annotations and verifiable output challenges

Assessing LLM capabilities in real-world malware analysis constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

MalEval framework for fine-grained malware auditing

Static reachability analysis to reduce noise

Function-level structural representations for verifiable evaluation

🔎 Similar Papers

Large Language Models for Cyber Security: A Systematic Literature Review

2024-05-08arXiv.orgCitations: 27

Apple

Seattle, United States of America

Authors to Follow