Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Traditional penetration testing is costly, time-consuming, and heavily reliant on human expertise; meanwhile, existing AI-based agents are predominantly evaluated in simplified, synthetic environments, exhibiting poor generalization to real-world networks. Method: This paper proposes a novel paradigm for automated penetration testing tailored to realistic network environments, shifting the objective from “flag capture” to achieving full system compromise. To this end, we introduce TermiBench—the first open-source, real-world benchmark comprising 510 hosts, 25 services, and 30 CVEs—and TermiAgent, a multi-agent framework featuring a novel *locality-aware memory activation* mechanism to mitigate long-context forgetting, alongside structured code understanding to build a reusable exploit toolkit. Contribution/Results: Experiments demonstrate substantial improvements in shell acquisition rate, reduced execution time, and lowered hardware requirements—enabling deployment on commodity laptops. TermiBench and TermiAgent constitute the first open-source autonomous penetration testing framework and benchmark validated in realistic settings.

Technology Category

Application Category

📝 Abstract

Penetration testing is critical for identifying and mitigating security vulnerabilities, yet traditional approaches remain expensive, time-consuming, and dependent on expert human labor. Recent work has explored AI-driven pentesting agents, but their evaluation relies on oversimplified capture-the-flag (CTF) settings that embed prior knowledge and reduce complexity, leading to performance estimates far from real-world practice. We close this gap by introducing the first real-world, agent-oriented pentesting benchmark, TermiBench, which shifts the goal from 'flag finding' to achieving full system control. The benchmark spans 510 hosts across 25 services and 30 CVEs, with realistic environments that require autonomous reconnaissance, discrimination between benign and exploitable services, and robust exploit execution. Using this benchmark, we find that existing systems can hardly obtain system shells under realistic conditions. To address these challenges, we propose TermiAgent, a multi-agent penetration testing framework. TermiAgent mitigates long-context forgetting with a Located Memory Activation mechanism and builds a reliable exploit arsenal via structured code understanding rather than naive retrieval. In evaluations, our work outperforms state-of-the-art agents, exhibiting stronger penetration testing capability, reducing execution time and financial cost, and demonstrating practicality even on laptop-scale deployments. Our work delivers both the first open-source benchmark for real-world autonomous pentesting and a novel agent framework that establishes a milestone for AI-driven penetration testing.

Problem

Research questions and friction points this paper is trying to address.

Real-world penetration testing benchmark lacking for AI agents

Existing systems fail to achieve full system control

Addresses oversimplified CTF evaluations versus practical complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework for penetration testing

Located Memory Activation to prevent forgetting

Structured code understanding for exploit arsenal

🔎 Similar Papers

No similar papers found.