Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

📅 2024-08-15
📈 Citations: 22
Influential: 4
📄 PDF
🤖 AI Summary
Existing evaluations of large language models (LLMs) in cybersecurity lack systematic, quantitative assessment—particularly for penetration testing. Method: We propose the first open-source benchmark framework specifically designed to quantify LLMs’ penetration testing capabilities. It comprises 40 real-world CTF challenges with fine-grained subtasks, a sandboxed command-execution environment, a multi-agent scaffolding architecture (including structured bash, action-only, pseudoterminal, and web-search agents), and a cross-model consistency evaluation protocol. Crucially, we introduce a subtask decomposition mechanism enabling progressive, granular diagnosis of LM agent capabilities and achieve, for the first time, fully automated and observable end-to-end penetration testing evaluation. Results: Experiments show GPT-4o and Claude 3.5 Sonnet solve human-level tasks—requiring ~11 minutes for humans—in zero-shot settings; the hardest tasks take humans over 24 hours. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract
Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have potential to cause real-world impact. Policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks for each task, which break down a task into intermediary steps for a more detailed evaluation. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. For the top performing models (GPT-4o and Claude 3.5 Sonnet), we further investigate performance across 4 agent scaffolds (structed bash, action-only, pseudoterminal, and web search). Without subtask guidance, agents leveraging Claude 3.5 Sonnet, GPT-4o, OpenAI o1-preview, and Claude 3 Opus successfully solved complete tasks that took human teams up to 11 minutes to solve. In comparison, the most difficult task took human teams 24 hours and 54 minutes to solve. All code and data are publicly available at https://cybench.github.io.
Problem

Research questions and friction points this paper is trying to address.

Evaluating cybersecurity capabilities of Language Model agents
Mitigating cyberrisks through autonomous vulnerability identification
Assessing LM agents in professional-level Capture the Flag tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework for evaluating LM cybersecurity capabilities
Includes 40 professional Capture the Flag tasks
Introduces subtasks for detailed agent evaluation
🔎 Similar Papers
No similar papers found.
A
Andy K. Zhang
Stanford University
N
Neil Perry
Stanford University
R
Riya Dulepet
Stanford University
Eliot Jones
Eliot Jones
Head of Offensive Cybersecurity, Gray Swan AI
J
Justin W. Lin
Stanford University
J
Joey Ji
Stanford University
C
Celeste Menders
Stanford University
G
Gashon Hussein
Stanford University
S
Samantha Liu
Stanford University
D
Donovan Jasper
Stanford University
P
Pura Peetathawatchai
Stanford University
A
Ari Glenn
Stanford University
V
Vikram Sivashankar
Stanford University
D
Daniel Zamoshchin
Stanford University
L
Leo Glikbarg
Stanford University
D
Derek Askaryar
Stanford University
M
Mike Yang
Stanford University
T
Teddy Zhang
Stanford University
R
Rishi Alluri
Stanford University
N
Nathan Tran
Stanford University
R
Rinnara Sangpisit
Stanford University
P
Polycarpos Yiorkadjis
Stanford University
K
Kenny Osele
Stanford University
G
Gautham Raghupathi
Stanford University
Dan Boneh
Dan Boneh
Professor of Computer Science, Stanford University
CryptographyComputer SecurityComputer Science Theory
Daniel E. Ho
Daniel E. Ho
Stanford University
Regulatory policyartificial intelligenceadministrative lawantidiscrimination
Percy Liang
Percy Liang
Associate Professor of Computer Science, Stanford University
machine learningnatural language processing