Cochise: A Reference Harness for Autonomous Penetration Testing

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the challenge of evaluating large language model (LLM)-driven autonomous penetration testing systems, whose complex architectures often obscure whether performance gains stem from the underlying model or engineering design choices. To enable rigorous and reproducible research, the authors propose a lightweight (597 lines of code) Python reference framework featuring a Planner–Executor separation architecture. The framework connects LLM agents to Linux execution hosts via SSH and supports configurable target environments and external state management. By modularly decoupling agent architecture, prompting strategies, and tool integration, it facilitates fair comparisons and experimental reproducibility. Additionally, it provides open-sourced JSON trajectory logs and multidimensional analysis tools tracking cost, token usage, duration, and exploitation paths, enabling offline replay. Validated on the Game of Active Directory platform, the framework is accompanied by a complete dataset of execution traces, significantly lowering the barrier to entry for research on penetration-testing agents.

📝 Abstract

Recent work on LLM-driven autonomous penetration testing reports promising results, but existing systems often combine many architectural, prompting, and tool-integration choices, making it difficult to tell what is gained over a simple agent scaffold. We present cochise, a 597 LOC Python reference harness for autonomous penetration-testing experiments. Cochise connects an LLM-driven agent to a Linux execution host over SSH and supports controlled target environments reachable from that jump host. The prototype implements a separated Planner--Executor architecture in which long-term state is maintained outside the LLM context, while a ReAct-style executor issues commands over SSH and self-corrects based on command outputs. The scenario prompt can be adapted to different target environments. To demonstrate the efficacy of our minimal harness, we evaluate it against a live third-party testbed called Game of Active Directory (GOAD). Alongside the harness, we release replay and analysis tools: (i) cochise-replay for offline visualization of captured runs, (ii) cochise-analyze-alogs and cochise-analyze-graphs for cost, token, duration, and compromise analysis, and (iii) a corpus of JSON trajectory logs from GOAD runs, allowing researchers to study agent behavior without provisioning the 48--64 GB RAM / 190 GB storage testbed themselves. Cochise is intended not as a state-of-the-art pen-testing agent, but as reusable experimental infrastructure for comparing models, agent architectures, and penetration-testing traces.

Problem

Research questions and friction points this paper is trying to address.

autonomous penetration testing

LLM-driven agent

experimental infrastructure

benchmarking

agent architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

autonomous penetration testing

LLM agent architecture

Planner-Executor separation