ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This study evaluates the capability of large language model (LLM) agents to autonomously discover and remediate security vulnerabilities in previously unknown zero-day scenarios. To this end, we introduce ZeroDayBench, the first systematic benchmark specifically designed for zero-day vulnerability assessment, comprising 22 novel critical vulnerabilities embedded within real-world open-source codebases to evaluate proactive defense performance. Experiments conducted with state-of-the-art models—including GPT-5.2, Claude Sonnet 4.5, and Grok 4.1—demonstrate that current LLMs remain unable to fully automate the end-to-end process of vulnerability identification and patching. These findings delineate the current limitations of LLMs in autonomous cyber defense and highlight key directions for improvement. This work establishes a foundational benchmark and provides empirical insights for advancing LLM-driven automated security response systems.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly being deployed as software engineering agents that autonomously contribute to repositories. A major benefit these agents present is their ability to find and patch security vulnerabilities in the codebases they oversee. To estimate the capability of agents in this domain, we introduce ZeroDayBench, a benchmark where LLM agents find and patch 22 novel critical vulnerabilities in open-source codebases. We focus our efforts on three popular frontier agentic LLMs: GPT-5.2, Claude Sonnet 4.5, and Grok 4.1. We find that frontier LLMs are not yet capable of autonomously solving our tasks and observe some behavioral patterns that suggest how these models can be improved in the domain of proactive cyberdefense.

Problem

Research questions and friction points this paper is trying to address.

zero-day vulnerabilities

LLM agents

cyberdefense

autonomous patching

security evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

ZeroDayBench

LLM agents

zero-day vulnerabilities