PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional penetration testing is labor-intensive and poorly scalable; existing LLM-based approaches lack systematic task decomposition and domain-specific adaptation, resulting in unreliable black-box behavior. Method: We propose PentestEval, the first end-to-end LLM benchmark for penetration testing, decomposing the workflow into six stages—reconnaissance, vulnerability filtering, attack planning, exploit generation, post-exploitation, and reporting—and curating an expert-annotated dataset of 346 tasks across 12 real-world vulnerable environments, coupled with a fully automated evaluation pipeline. Contribution/Results: We introduce a modular, stage-wise evaluation paradigm with multi-dimensional metrics (functional correctness, safety, executability) and dual-track (end-to-end + stage-level) assessment protocols. Experiments show that nine mainstream LLMs achieve only 31% end-to-end success; autonomous agents fail in >90% of stages. Modular design significantly improves both per-stage performance and overall robustness, validating structured reasoning as critical for autonomous penetration testing.

Technology Category

Application Category

📝 Abstract
Penetration testing is essential for assessing and strengthening system security against real-world threats, yet traditional workflows remain highly manual, expertise-intensive, and difficult to scale. Although recent advances in Large Language Models (LLMs) offer promising opportunities for automation, existing applications rely on simplistic prompting without task decomposition or domain adaptation, resulting in unreliable black-box behavior and limited insight into model capabilities across penetration testing stages. To address this gap, we introduce PentestEval, the first comprehensive benchmark for evaluating LLMs across six decomposed penetration testing stages: Information Collection, Weakness Gathering and Filtering, Attack Decision-Making, Exploit Generation and Revision. PentestEval integrates expert-annotated ground truth with a fully automated evaluation pipeline across 346 tasks covering all stages in 12 realistic vulnerable scenarios. Our stage-level evaluation of 9 widely used LLMs reveals generally weak performance and distinct limitations across the stages of penetration-testing workflow. End-to-end pipelines reach only 31% success rate, and existing LLM-powered systems such as PentestGPT, PentestAgent, and VulnBot exhibit similar limitations, with autonomous agents failing almost entirely. These findings highlight that autonomous penetration testing demands stronger structured reasoning, where modularization enhances each individual stage and improves overall performance. PentestEval provides the foundational benchmark needed for future research on fine-grained, stage-level evaluation, paving the way toward more reliable LLM-based automation.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs in decomposed penetration testing stages
Benchmarks automated evaluation across realistic vulnerable scenarios
Identifies limitations in current LLM-based penetration testing systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes penetration testing into six modular stages
Integrates expert-annotated ground truth with automated evaluation
Highlights need for structured reasoning over simplistic prompting
🔎 Similar Papers
No similar papers found.
R
Ruozhao Yang
School of Computing and Information Systems, Singapore Management University, 188065, Singapore
Mingfei Cheng
Mingfei Cheng
Singapore Management University
Software EngineeringSoftware TestingAutonomous DrivingAI System
Gelei Deng
Gelei Deng
Nanyang Technological University
CybersecuritySystem securityRobotics SecurityAI SecuritySoftware Testing
T
Tianwei Zhang
School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798
J
Junjie Wang
School of Cyber Security, Tianjin University, Tianjin 300350, China
Xiaofei Xie
Xiaofei Xie
Singapore Management University
Software EngineeringLoop AnalysisTestingDeep Learning