🤖 AI Summary
This study evaluates the capability of large language model (LLM) agents to autonomously discover and remediate security vulnerabilities in previously unknown zero-day scenarios. To this end, we introduce ZeroDayBench, the first systematic benchmark specifically designed for zero-day vulnerability assessment, comprising 22 novel critical vulnerabilities embedded within real-world open-source codebases to evaluate proactive defense performance. Experiments conducted with state-of-the-art models—including GPT-5.2, Claude Sonnet 4.5, and Grok 4.1—demonstrate that current LLMs remain unable to fully automate the end-to-end process of vulnerability identification and patching. These findings delineate the current limitations of LLMs in autonomous cyber defense and highlight key directions for improvement. This work establishes a foundational benchmark and provides empirical insights for advancing LLM-driven automated security response systems.
📝 Abstract
Large language models (LLMs) are increasingly being deployed as software engineering agents that autonomously contribute to repositories. A major benefit these agents present is their ability to find and patch security vulnerabilities in the codebases they oversee. To estimate the capability of agents in this domain, we introduce ZeroDayBench, a benchmark where LLM agents find and patch 22 novel critical vulnerabilities in open-source codebases. We focus our efforts on three popular frontier agentic LLMs: GPT-5.2, Claude Sonnet 4.5, and Grok 4.1. We find that frontier LLMs are not yet capable of autonomously solving our tasks and observe some behavioral patterns that suggest how these models can be improved in the domain of proactive cyberdefense.