ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study evaluates the capability of large language model (LLM) agents to autonomously discover and remediate security vulnerabilities in previously unknown zero-day scenarios. To this end, we introduce ZeroDayBench, the first systematic benchmark specifically designed for zero-day vulnerability assessment, comprising 22 novel critical vulnerabilities embedded within real-world open-source codebases to evaluate proactive defense performance. Experiments conducted with state-of-the-art models—including GPT-5.2, Claude Sonnet 4.5, and Grok 4.1—demonstrate that current LLMs remain unable to fully automate the end-to-end process of vulnerability identification and patching. These findings delineate the current limitations of LLMs in autonomous cyber defense and highlight key directions for improvement. This work establishes a foundational benchmark and provides empirical insights for advancing LLM-driven automated security response systems.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly being deployed as software engineering agents that autonomously contribute to repositories. A major benefit these agents present is their ability to find and patch security vulnerabilities in the codebases they oversee. To estimate the capability of agents in this domain, we introduce ZeroDayBench, a benchmark where LLM agents find and patch 22 novel critical vulnerabilities in open-source codebases. We focus our efforts on three popular frontier agentic LLMs: GPT-5.2, Claude Sonnet 4.5, and Grok 4.1. We find that frontier LLMs are not yet capable of autonomously solving our tasks and observe some behavioral patterns that suggest how these models can be improved in the domain of proactive cyberdefense.
Problem

Research questions and friction points this paper is trying to address.

zero-day vulnerabilities
LLM agents
cyberdefense
autonomous patching
security evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

ZeroDayBench
LLM agents
zero-day vulnerabilities
autonomous patching
cyberdefense
🔎 Similar Papers
N
Nancy Lau
UC Santa Cruz
L
Louis Sloot
Carnegie Mellon University
J
Jyoutir Raj
Independent
G
Giuseppe Marco Boscardin
Independent
E
Evan Harris
Independent
Dylan Bowman
Dylan Bowman
HUD
deep learninglarge language modelsAI securityAI safetyAI alignment
M
Mario Brajkovski
HUD
J
Jaideep Chawla
HUD
Dan Zhao
Dan Zhao
New York University (NYU), Massachusetts Institute of Technology (MIT)
Artificial IntelligenceAgentsEfficient AISustainable AIQuantum Computing