Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

📅 2024-10-22

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

216K/year

🤖 AI Summary

A publicly available, end-to-end benchmark for systematically evaluating large language models’ (LLMs) real-world capabilities in automated penetration testing is currently lacking. Method: We introduce PentestBench—the first open-source LLM penetration testing benchmark—covering the full lifecycle: reconnaissance, vulnerability exploitation, and privilege escalation. Built upon the PentestGPT framework, it enables multi-model comparison, chained-task evaluation, and ablation analysis. Contribution/Results: Experiments reveal that Llama 3.1-405B marginally outperforms GPT-4o, yet no model achieves fully autonomous, end-to-end penetration testing. Critical failure points are precisely localized across stages. This work establishes the first LLM evaluation paradigm explicitly aligned with the penetration testing lifecycle, uncovering fundamental capability bottlenecks and proposing actionable, implementation-ready improvements—thereby filling a critical gap in AI-driven security automation assessment.

Technology Category

Application Category

📝 Abstract

Hacking poses a significant threat to cybersecurity, inflicting billions of dollars in damages annually. To mitigate these risks, ethical hacking, or penetration testing, is employed to identify vulnerabilities in systems and networks. Recent advancements in large language models (LLMs) have shown potential across various domains, including cybersecurity. However, there is currently no comprehensive, open, end-to-end automated penetration testing benchmark to drive progress and evaluate the capabilities of these models in security contexts. This paper introduces a novel open benchmark for LLM-based automated penetration testing, addressing this critical gap. We first evaluate the performance of LLMs, including GPT-4o and Llama 3.1-405B, using the state-of-the-art PentestGPT tool. Our findings reveal that while Llama 3.1 demonstrates an edge over GPT-4o, both models currently fall short of performing fully automated, end-to-end penetration testing. Next, we advance the state-of-the-art and present ablation studies that provide insights into improving the PentestGPT tool. Our research illuminates the challenges LLMs face in each aspect of Pentesting, e.g. enumeration, exploitation, and privilege escalation. This work contributes to the growing body of knowledge on AI-assisted cybersecurity and lays the foundation for future research in automated penetration testing using large language models.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Cybersecurity Vulnerability Detection

Comprehensive Public Testing Standards

Innovation

Methods, ideas, or system contributions that make the work stand out.

PentestGPT

Large Language Models

Automated Penetration Testing

🔎 Similar Papers

No similar papers found.