What Makes a Good LLM Agent for Real-world Penetration Testing?

📅 2026-02-19

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the instability of existing large language model (LLM) agents in real-world penetration testing, which stems primarily from two sources: capability gaps (Type A) and deficiencies in planning and state management (Type B), the latter exacerbated by the absence of real-time task difficulty assessment. To overcome these limitations, the authors propose Excalibur, an agent that eliminates Type A gaps through typed tool interfaces and retrieval-augmented skills, and introduces a novel Task Difficulty Assessment (TDA) mechanism. TDA integrates four dimensions—horizon estimation, evidence confidence, context load, and historical success rate—to inform exploration-exploitation trade-offs. Coupled with Evidence-Guided Attack Tree Search (EGATS), this enables difficulty-aware planning. Excalibur achieves a 91% task completion rate on CTF benchmarks, outperforming baselines by 39–49%, and successfully compromises 4 out of 5 hosts in the GOAD Active Directory environment, substantially surpassing current systems.

Technology Category

Application Category

📝 Abstract

LLM-based agents show promise for automating penetration testing, yet reported performance varies widely across systems and benchmarks. We analyze 28 LLM-based penetration testing systems and evaluate five representative implementations across three benchmarks of increasing complexity. Our analysis reveals two distinct failure modes: Type A failures stem from capability gaps (missing tools, inadequate prompts) that engineering readily addresses, while Type B failures persist regardless of tooling due to planning and state management limitations. We show that Type B failures share a root cause that is largely invariant to the underlying LLM: agents lack real-time task difficulty estimation. As a result, agents misallocate effort, over-commit to low-value branches, and exhaust context before completing attack chains. Based on this insight, we present Excalibur, a penetration testing agent that couples strong tooling with difficulty-aware planning. A Tool and Skill Layer eliminates Type A failures through typed interfaces and retrieval-augmented knowledge. A Task Difficulty Assessment (TDA) mechanism addresses Type B failures by estimating tractability through four measurable dimensions (horizon estimation, evidence confidence, context load, and historical success) and uses these estimates to guide exploration-exploitation decisions within an Evidence-Guided Attack Tree Search (EGATS) framework. Excalibur achieves up to 91% task completion on CTF benchmarks with frontier models (39 to 49% relative improvement over baselines) and compromises 4 of 5 hosts on the GOAD Active Directory environment versus 2 by prior systems. These results show that difficulty-aware planning yields consistent end-to-end gains across models and addresses a limitation that model scaling alone does not eliminate.

Problem

Research questions and friction points this paper is trying to address.

LLM agent

penetration testing

task difficulty estimation

planning failure

state management

Innovation

Methods, ideas, or system contributions that make the work stand out.

difficulty-aware planning

Task Difficulty Assessment (TDA)

Evidence-Guided Attack Tree Search (EGATS)