SPECA: Specification-to-Checklist Agentic Auditing for Multi-Implementation Systems -- A Case Study on Ethereum Clients

📅 2026-02-07

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the challenge that ambiguous requirements in natural language specifications often lead to consistent errors across multiple implementations, which traditional differential testing fails to detect. To this end, the authors propose SPECA, a novel framework that automatically translates informal specifications into structured checklists and maps them to critical code locations across diverse implementations, enabling checklist-driven, one-to-many cross-implementation auditing. SPECA integrates natural language processing, threat modeling, and agent-based automated auditing to construct an end-to-end specification alignment verification system. Evaluated on the Ethereum Fusaka upgrade, the approach identified 76.5% of valid vulnerabilities through cross-implementation checks, and its optimized auditing agent achieved a 27.3% recall rate on high-severity vulnerabilities, outperforming 96% of human auditors.

Technology Category

Application Category

📝 Abstract

Multi-implementation systems are increasingly audited against natural-language specifications. Differential testing scales well when implementations disagree, but it provides little signal when all implementations converge on the same incorrect interpretation of an ambiguous requirement. We present SPECA, a Specification-to-Checklist Auditing framework that turns normative requirements into checklists, maps them to implementation locations, and supports cross-implementation reuse. We instantiate SPECA in an in-the-wild security audit contest for the Ethereum Fusaka upgrade, covering 11 production clients. Across 54 submissions, 17 were judged valid by the contest organizers. Cross-implementation checks account for 76.5 percent (13 of 17) of valid findings, suggesting that checklist-derived one-to-many reuse is a practical scaling mechanism in multi-implementation audits. To understand false positives, we manually coded the 37 invalid submissions and find that threat model misalignment explains 56.8 percent (21 of 37): reports that rely on assumptions about trust boundaries or scope that contradict the audit's rules. We detected no High or Medium findings in the V1 deployment; misses concentrated in specification details and implicit assumptions (57.1 percent), timing and concurrency issues (28.6 percent), and external library dependencies (14.3 percent). Our improved agent, evaluated against the ground truth of a competitive audit, achieved a strict recall of 27.3 percent on high-impact vulnerabilities, placing it in the top 4 percent of human auditors and outperforming 49 of 51 contestants on critical issues. These results, though from a single deployment, suggest that early, explicit threat modeling is essential for reducing false positives and focusing agentic auditing effort. The agent-driven process enables expert validation and submission in about 40 minutes on average.

Problem

Research questions and friction points this paper is trying to address.

multi-implementation systems

specification ambiguity

security auditing

false positives

differential testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Specification-to-Checklist

Agentic Auditing

Multi-Implementation Systems