🤖 AI Summary
This work addresses critical challenges in applying large language models (LLMs) to WebShell detection—namely, precision-recall imbalance, contextual redundancy, and weak discriminative capability—by proposing BFAD, the first behavior-aware LLM adaptation framework. BFAD comprises three core innovations: (1) key-function filtering via PHP syntax analysis; (2) context-aware code extraction leveraging dynamic context truncation; and (3) weighted behavioral function profiling grounded in function-level behavioral modeling and functionality-driven prompt weighting. Evaluated on a benchmark of 26,590 PHP samples, BFAD enhances multiple LLMs—including GPT-4, LLaMA 3.1 70B, and Qwen 2.5—via instruction tuning and in-context learning, yielding an average F1-score improvement of 13.82%. Notably, GPT-4 surpasses traditional state-of-the-art methods, while Qwen 2.5 3B matches their performance—constituting the first systematic validation of LLMs’ feasibility and optimization paradigm for WebShell detection.
📝 Abstract
WebShell attacks, in which malicious scripts are injected into web servers, are a major cybersecurity threat. Traditional machine learning and deep learning methods are hampered by issues such as the need for extensive training data, catastrophic forgetting, and poor generalization. Recently, Large Language Models (LLMs) have gained attention for code-related tasks, but their potential in WebShell detection remains underexplored. In this paper, we make two major contributions: (1) a comprehensive evaluation of seven LLMs, including GPT-4, LLaMA 3.1 70B, and Qwen 2.5 variants, benchmarked against traditional sequence- and graph-based methods using a dataset of 26.59K PHP scripts, and (2) the Behavioral Function-Aware Detection (BFAD) framework, designed to address the specific challenges of applying LLMs to this domain. Our framework integrates three components: a Critical Function Filter that isolates malicious PHP function calls, a Context-Aware Code Extraction strategy that captures the most behaviorally indicative code segments, and Weighted Behavioral Function Profiling (WBFP) that enhances in-context learning by prioritizing the most relevant demonstrations based on discriminative function-level profiles. Our results show that larger LLMs achieve near-perfect precision but lower recall, while smaller models exhibit the opposite trade-off. However, all models lag behind previous State-Of-The-Art (SOTA) methods. With BFAD, the performance of all LLMs improved, with an average F1 score increase of 13.82%. Larger models such as GPT-4, LLaMA 3.1 70B, and Qwen 2.5 14B outperform SOTA methods, while smaller models such as Qwen 2.5 3B achieve performance competitive with traditional methods. This work is the first to explore the feasibility and limitations of LLMs for WebShell detection, and provides solutions to address the challenges in this task.