From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

๐Ÿ“… 2026-05-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

221K/year
๐Ÿค– AI Summary
Current evaluations of AI-powered penetration testing agents are often confined to simplified scenarios that fail to capture the complexity, openness, and strategic depth of real-world offensive operations. This work proposes a red-team-oriented evaluation protocol that shifts the focus from mere task completion to verified vulnerability discovery across diverse attack surfaces and vulnerability types. The approach integrates structured ground truth with large language modelโ€“based semantic matching for vulnerability identification, enhanced by bipartite graph parsing for scoring, continuous ground truth maintenance, and efficient test suite selection. It further introduces stochastic agent repetition for cumulative assessment and incorporates efficiency metrics. The study delivers a reproducible, realistic evaluation framework, accompanied by an expert-annotated dataset and open-source code, substantially enhancing the operational relevance and guidance value of agent evaluations.
๐Ÿ“ Abstract
AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings. These tools are valuable for measuring bounded capabilities, yet they do not adequately capture the complexity, open-ended exploration, and strategic decision-making required in realistic pentesting. In this paper, we present a practical evaluation protocol that shifts assessment from task completion to validated vulnerability discovery, allowing evaluation in sufficiently complex targets spanning multiple attack surfaces and vulnerability classes. The protocol combines structured ground-truth with LLM-based semantic matching to identify vulnerabilities, bipartite resolution to score findings under realistic ambiguity, continuous ground-truth maintenance, repeated and cumulative evaluation of stochastic agents, efficiency metrics, and reduced-suite selection for sustainable experimentation. This protocol extends the state of the art by enabling a more realistic, operationally informative comparison of AI pentesting agents. To enable reproducibility, we also release expert-annotated ground truth and code for the proposed evaluation protocol: https://github.com/jd0965199-oss/ethibench.
Problem

Research questions and friction points this paper is trying to address.

AI pentesting
real-world evaluation
vulnerability discovery
benchmarking
offensive security
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI pentesting
vulnerability discovery
evaluation protocol
LLM-based semantic matching
bipartite resolution
๐Ÿ”Ž Similar Papers
No similar papers found.
๐Ÿ’ผ Related Jobs
P
Pedro Conde
Ethiack, Coimbra, Portugal
H
Henrique Branquinho
Ethiack, Coimbra, Portugal
V
Valerio Mazzone
Ethiack, Coimbra, Portugal
B
Bruno Mendes
Ethiack, Porto, Portugal
A
Andrรฉ Baptista
Ethiack, Porto, Portugal
Nuno Moniz
Nuno Moniz
Associate Research Professor at Lucy Family Institute for Data & Society, University of Notre Dame
Imbalanced LearningResponsible AIData Privacy