🤖 AI Summary
The widespread adoption of AI-generated code has precipitated a scalability crisis in security auditing: approximately 40% of such code contains vulnerabilities, while manual auditing lags significantly behind development velocity. To address this, we propose a multi-agent penetration testing system tailored for web applications, integrating collaborative large language model (LLM) reasoning with tool-augmented execution to establish a closed-loop workflow spanning vulnerability discovery and exploit validation. We introduce a cost-sensitive decision mechanism enabling dynamic resource allocation and early termination. Evaluated on the XBOW benchmark, our system achieves an overall success rate of 76.9%, with perfect detection (100%) for SSRF and misconfiguration vulnerabilities. It further uncovers critical flaws—including remote code execution (RCE) and command injection—in multiple high-star GitHub repositories. Crucially, the average cost per assessment is only $3.67, demonstrating strong practical viability and cost efficiency.
📝 Abstract
AI-powered development platforms are making software creation accessible to a broader audience, but this democratization has triggered a scalability crisis in security auditing. With studies showing that up to 40% of AI-generated code contains vulnerabilities, the pace of development now vastly outstrips the capacity for thorough security assessment.
We present MAPTA, a multi-agent system for autonomous web application security assessment that combines large language model orchestration with tool-grounded execution and end-to-end exploit validation. On the 104-challenge XBOW benchmark, MAPTA achieves 76.9% overall success with perfect performance on SSRF and misconfiguration vulnerabilities, 83% success on broken authorization, and strong results on injection attacks including server-side template injection (85%) and SQL injection (83%). Cross-site scripting (57%) and blind SQL injection (0%) remain challenging. Our comprehensive cost analysis across all challenges totals $21.38 with a median cost of $0.073 for successful attempts versus $0.357 for failures. Success correlates strongly with resource efficiency, enabling practical early-stopping thresholds at approximately 40 tool calls or $0.30 per challenge.
MAPTA's real-world findings are impactful given both the popularity of the respective scanned GitHub repositories (8K-70K stars) and MAPTA's low average operating cost of $3.67 per open-source assessment: MAPTA discovered critical vulnerabilities including RCEs, command injections, secret exposure, and arbitrary file write vulnerabilities. Findings are responsibly disclosed, 10 findings are under CVE review.