🤖 AI Summary
This work addresses the critical risk of substantial blockchain asset losses due to smart contract vulnerabilities by introducing EVMbench, the first end-to-end evaluation benchmark tailored for AI agents. Built upon 117 real-world vulnerabilities and a local Ethereum execution environment, the framework programmatically assesses AI capabilities across the full spectrum of vulnerability detection, repair, and exploitation. Integrating the Ethereum Virtual Machine (EVM), a curated vulnerability dataset, on-chain state validation, and AI-driven code generation models, EVMbench demonstrates empirically that state-of-the-art AI agents can autonomously discover and exploit vulnerabilities in realistic settings. The entire suite—including tasks, code, and tooling—is open-sourced to foster ongoing security evaluation and research in this domain.
📝 Abstract
Smart contracts on public blockchains now manage large amounts of value, and vulnerabilities in these systems can lead to substantial losses. As AI agents become more capable at reading, writing, and running code, it is natural to ask how well they can already navigate this landscape, both in ways that improve security and in ways that might increase risk. We introduce EVMbench, an evaluation that measures the ability of agents to detect, patch, and exploit smart contract vulnerabilities. EVMbench draws on 117 curated vulnerabilities from 40 repositories and, in the most realistic setting, uses programmatic grading based on tests and blockchain state under a local Ethereum execution environment. We evaluate a range of frontier agents and find that they are capable of discovering and exploiting vulnerabilities end-to-end against live blockchain instances. We release code, tasks, and tooling to support continued measurement of these capabilities and future work on security.