CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI security benchmarks predominantly rely on abstract CTF-style scenarios, lacking systematic coverage of real-world web application vulnerabilities and thus failing to rigorously evaluate LLM-based agents’ autonomous offensive and defensive capabilities in near-production environments. Method: We introduce the first LLM agent attack capability benchmark targeting high-severity CVE vulnerabilities, built upon a reproducible, strongly isolated dynamic sandbox framework. It integrates Dockerized vulnerable applications (VulnApps), LLM agent orchestration, and multi-dimensional exploitation validation—including payload execution, data exfiltration, and privilege escalation. Contribution/Results: This work pioneers the systematic incorporation of real-world CVEs into AI security evaluation, overcoming the abstraction and coverage limitations of prior CTF-based benchmarks. Experiments reveal that state-of-the-art LLM agents successfully exploit only 13% of high-severity CVEs, exposing critical gaps in practical offensive capability. The benchmark provides a reproducible, extensible foundation for AI security assessment and defense research.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized expertise to reproduce exploits and a systematic approach to evaluating unpredictable threats. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our evaluation shows that the state-of-the-art agent framework can resolve up to 13% of vulnerabilities.
Problem

Research questions and friction points this paper is trying to address.

Assess LLM agents' ability to exploit real-world web vulnerabilities
Address lack of comprehensive benchmarks for real-world cyber threats
Develop sandbox framework to evaluate unpredictable exploit scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-world cybersecurity benchmark for LLM agents
Sandbox framework mimicking real-world exploit scenarios
Evaluation of LLM agents' vulnerability exploitation ability
🔎 Similar Papers
No similar papers found.
Yuxuan Zhu
Yuxuan Zhu
PhD student, University of Illinois Urbana-Champaign
Data systemsAI evaluation
Antony Kellermann
Antony Kellermann
Visiting AI Safety Researcher, UIUC
machine learning
Dylan Bowman
Dylan Bowman
HUD
deep learninglarge language modelsAI securityAI safetyAI alignment
Philip Li
Philip Li
University of Illinois Urbana-Champaign
Akul Gupta
Akul Gupta
University of Illinois Urbana-Champaign
A
Adarsh Danda
Siebel School of Computing and Data Science, University of Illinois, Urbana-Champaign, USA
Richard Fang
Richard Fang
University of Illinois
C
Conner Jensen
Siebel School of Computing and Data Science, University of Illinois, Urbana-Champaign, USA
E
Eric Ihli
Jason Benn
Jason Benn
Independent
ML Systems
J
Jet Geronimo
Siebel School of Computing and Data Science, University of Illinois, Urbana-Champaign, USA
A
Avi Dhir
Siebel School of Computing and Data Science, University of Illinois, Urbana-Champaign, USA
S
Sudhit Rao
Siebel School of Computing and Data Science, University of Illinois, Urbana-Champaign, USA
Kaicheng Yu
Kaicheng Yu
Assistant Professor, Westlake University, PI of Autonomous Intelligence Lab
computer vision3D understandingautonomous perceptionautomatic machine learning
T
Twm Stone
Daniel Kang
Daniel Kang
UIUC
Computer Science