Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This study addresses the overly optimistic assessment of AI agents’ capabilities in smart contract security auditing by existing benchmarks such as EVMbench. The authors construct the first temporally uncontaminated real-world vulnerability dataset and systematically evaluate 26 agent configurations spanning four model families and three scaffolding frameworks on end-to-end vulnerability detection and exploitation tasks. Results reveal that none of the 110 agent–vulnerability pairings achieved full exploitation; open-source scaffolds outperformed commercial alternatives by up to five percentage points; and agent performance is highly dependent on scaffold design, with exploitation—not detection—emerging as the primary bottleneck. The findings demonstrate that current AI systems cannot independently complete auditing tasks and must operate in collaboration with human experts.

Technology Category

Application Category

📝 Abstract

EVMbench, released by OpenAI, Paradigm, and OtterSec, is the first large-scale benchmark for AI agents on smart contract security. Its results -- agents detect up to 45.6% of vulnerabilities and exploit 72.2% of a curated subset -- have fueled expectations that fully automated AI auditing is within reach. We identify two limitations: its narrow evaluation scope (14 agent configurations, most models tested on only their vendor scaffold) and its reliance on audit-contest data published before every model's release that models may have seen during training. To address these, we expand to 26 configurations across four model families and three scaffolds, and introduce a contamination-free dataset of 22 real-world security incidents postdating every model's release date. Our evaluation yields three findings: (1) agents' detection results are not stable, with rankings shifting across configurations, tasks, and datasets; (2) on real-world incidents, no agent succeeds at end-to-end exploitation across all 110 agent-incident pairs despite detecting up to 65% of vulnerabilities, contradicting EVMbench's conclusion that discovery is the primary bottleneck; and (3) scaffolding materially affects results, with an open-source scaffold outperforming vendor alternatives by up to 5 percentage points, yet EVMbench does not control for this. These findings challenge the narrative that fully automated AI auditing is imminent. Agents reliably catch well-known patterns and respond strongly to human-provided context, but cannot replace human judgment. For developers, agent scans serve as a pre-deployment check. For audit firms, agents are most effective within a human-in-the-loop workflow where AI handles breadth and human auditors contribute protocol-specific knowledge and adversarial reasoning. Code and data: https://github.com/blocksecteam/ReEVMBench/.

Problem

Research questions and friction points this paper is trying to address.

smart contract security

AI agents

benchmark evaluation

data contamination

automated auditing

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI agent evaluation

smart contract security

benchmark contamination