PenForge: On-the-Fly Expert Agent Construction for Automated Penetration Testing

📅 2026-01-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the limited generalization of existing automated penetration testing methods in complex scenarios, which struggle to effectively handle diverse and previously unseen vulnerabilities. To overcome this challenge, the authors propose a dynamic expert agent framework powered by large language models that constructs context-aware, specialized agents on the fly. By integrating automated reconnaissance with adaptive exploitation strategies, the approach transcends the constraints of traditional pre-defined agents. Evaluated on the CVE-Bench zero-day vulnerability benchmark, the method achieves a 30.0% (12 out of 40) exploitation success rate—three times higher than the current state-of-the-art—demonstrating significantly enhanced capability in discovering and exploiting unknown vulnerabilities.

Technology Category

Application Category

📝 Abstract

Penetration testing is essential for identifying vulnerabilities in web applications before real adversaries can exploit them. Recent work has explored automating this process with Large Language Model (LLM)-powered agents, but existing approaches either rely on a single generic agent that struggles in complex scenarios or narrowly specialized agents that cannot adapt to diverse vulnerability types. We therefore introduce PenForge, a framework that dynamically constructs expert agents during testing rather than relying on those prepared beforehand. By integrating automated reconnaissance of potential attack surfaces with agents instantiated on the fly for context-aware exploitation, PenForge achieves a 30.0% exploit success rate (12/40) on CVE-Bench in the particularly challenging zero-day setting, which is a 3 times improvement over the state-of-the-art. Our analysis also identifies three opportunities for future work: (1) supplying richer tool-usage knowledge to improve exploitation effectiveness; (2) extending benchmarks to include more vulnerabilities and attack types; and (3) fostering developer trust by incorporating explainable mechanisms and human review. As an emerging result with substantial potential impact, PenForge embodies the early-stage yet paradigm-shifting idea of on-the-fly agent construction, marking its promise as a step toward scalable and effective LLM-driven penetration testing.

Problem

Research questions and friction points this paper is trying to address.

automated penetration testing

Large Language Model

expert agents

vulnerability exploitation

adaptive agent construction

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-the-fly agent construction

LLM-powered penetration testing

dynamic expert agents