🤖 AI Summary
This study addresses the current lack of systematic architectural analysis and large-scale empirical comparison of large language model–driven automated penetration testing (AutoPT) frameworks under a unified benchmark. Adopting a Systematization of Knowledge (SoK) approach, it proposes the first structured taxonomy encompassing six key dimensions: agent architecture, planning, memory, execution, external knowledge integration, and evaluation benchmarks. The work conducts extensive experiments on 15 prominent AutoPT frameworks—including 13 open-source systems and 2 baselines—within a standardized penetration testing environment. Consuming over 10 billion tokens and producing more than 1,500 expert-reviewed execution logs, the study establishes the largest empirical evaluation benchmark to date, offering the community a reliable reference and clear guidance for future research directions.
📝 Abstract
The rapid advancement of Large Language Models (LLMs) has created new opportunities for Automated Penetration Testing (AutoPT), spawning numerous frameworks aimed at achieving end-to-end autonomous attacks. However, despite the proliferation of related studies, existing research generally lacks systematic architectural analysis and large-scale empirical comparisons under a unified benchmark. Therefore, this paper presents the first Systematization of Knowledge (SoK) focusing on the architectural design and comprehensive empirical evaluation of current LLM-based AutoPT frameworks. At systematization level, we comprehensively review existing framework designs across six dimensions: agent architecture, agent plan, agent memory, agent execution, external knowledge, and benchmarks. At empirical level, we conduct large-scale experiments on 13 representative open-source AutoPT frameworks and 2 baseline frameworks utilizing a unified benchmark. The experiments consumed over 10 billion tokens in total and generated more than 1,500 execution logs, which were manually reviewed and analyzed over four months by a panel of more than 15 researchers with expertise in cybersecurity. By investigating the latest progress in this rapidly developing field, we provide researchers with a structured taxonomy to understand existing LLM-based AutoPT frameworks and a large-scale empirical benchmark, along with promising directions for future research.