ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Current safety evaluations of large language models often oversimplify vulnerability exploitation as a binary outcome—whether a crash occurs or not—thereby overlooking the multi-stage progression from initial trigger to arbitrary code execution. This work proposes the first fine-grained evaluation framework grounded in a capability ladder, decomposing the exploitation process into 16 measurable capability milestones. Leveraging techniques including deterministic verification, runtime randomized challenge-response mechanisms, differential execution comparison, and signal handler validation, the study systematically assesses model exploitation capabilities across 41 V8 vulnerabilities under varying conditions. Results reveal that publicly deployed models typically achieve only crash triggering, whereas private state-of-the-art models accomplish full exploitation in approximately half of the cases, highlighting that effective exploitation against hardened targets remains an emerging frontier capability.

📝 Abstract

Exploitation is not a binary event. It is a ladder of acquiring progressive capabilities, from executing a single buggy line of code to taking full control of the target. However, existing LLM security benchmarks treat a crash as exploitation success. That single binary outcome collapses the hard parts of exploitation: the transition from triggering a bug to constructing reusable primitives and control. We present ExploitBench, a capability-graded benchmark that decomposes exploitation into 16 measurable flags, from coverage and crash through sandbox primitives, arbitrary read/write, control-flow hijack, and arbitrary code execution. Each capability is verified by a deterministic oracle that uses a per-run randomized challenge-response for primitives, differential execution against ground-truth binaries to measure progress, and a signal-handler proof for code execution. We instantiate ExploitBench on 41 V8 bugs because V8 is both widely deployed and exploitation-hardened. We report three arms: <model,env> as the primary measurement of model-environment capability, <model,env, adaptive coaching> as a secondary arm that adds adaptive coaching to test whether targeted feedback shifts outcomes, and <model,env,harness> as an ablation that swaps in the model's native CLI to check whether vendor-side optimizations increase exploitation capabilities. Our results show a sharp capability split between publicly deployed frontier models and the private frontier. Across the 8 publicly deployed models tested, reaching the vulnerable code and triggering a crash is routine, but arbitrary code execution is not. The private model shows arbitrary code execution on approximately half. Overall, results suggest that exploit construction against hardened targets is an emerging frontier capability.

Problem

Research questions and friction points this paper is trying to address.

exploitation

capability ladder

LLM security benchmark

hardened targets

arbitrary code execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

capability-graded benchmark

exploitation ladder

deterministic oracle