Automated Penetration Testing with LLM Agents and Classical Planning

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current LLM-based agents face critical bottlenecks in fully automated penetration testing: poor long-horizon planning coherence, insufficient complex reasoning capability, and inefficient collaboration with domain-specific security tools. To address these challenges, this paper proposes the Planner-Executor-Perceptor paradigm and introduces CHECKMATE—a novel framework integrating an external, PDDL-based enhanced classical planner as a structured “cognitive core” with state-of-the-art LLMs (e.g., Claude Code/Sonnet 4.5) in a multi-agent perception–planning–execution loop. This design overcomes the stability and interpretability limitations of end-to-end LLM-only approaches in security-critical tasks. Empirical evaluation demonstrates that CHECKMATE achieves over 20% higher penetration success rate and reduces execution time and computational cost by more than 50% compared to prior state-of-the-art systems, establishing new benchmarks for autonomous red-teaming.

Technology Category

Application Category

📝 Abstract

While penetration testing plays a vital role in cybersecurity, achieving fully automated, hands-off-the-keyboard execution remains a significant research challenge. In this paper, we introduce the "Planner-Executor-Perceptor (PEP)" design paradigm and use it to systematically review existing work and identify the key challenges in this area. We also evaluate existing penetration testing systems, with a particular focus on the use of Large Language Model (LLM) agents for this task. The results show that the out-of-the-box Claude Code and Sonnet 4.5 exhibit superior penetration capabilities observed to date, substantially outperforming all prior systems. However, a detailed analysis of their testing processes reveals specific strengths and limitations; notably, LLM agents struggle with maintaining coherent long-horizon plans, performing complex reasoning, and effectively utilizing specialized tools. These limitations significantly constrain its overall capability, efficiency, and stability. To address these limitations, we propose CHECKMATE, a framework that integrates enhanced classical planning with LLM agents, providing an external, structured "brain" that mitigates the inherent weaknesses of LLM agents. Our evaluation shows that CHECKMATE outperforms the state-of-the-art system (Claude Code) in penetration capability, improving benchmark success rates by over 20%. In addition, it delivers substantially greater stability, cutting both time and monetary costs by more than 50%.

Problem

Research questions and friction points this paper is trying to address.

Automating penetration testing remains a significant research challenge in cybersecurity.

LLM agents struggle with long-horizon planning and complex reasoning in penetration testing.

Integrating classical planning with LLM agents improves penetration capability and reduces costs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates classical planning with LLM agents

Provides external structured brain for agents

Improves success rates and cuts costs significantly

🔎 Similar Papers

Teams of LLM Agents can Exploit Zero-Day Vulnerabilities