🤖 AI Summary
Existing smart contract security analysis tools suffer from high false positive rates, low accuracy, and poor usability, which undermine developer trust and limit real-world adoption. This study presents the first comprehensive evaluation combining large-scale empirical assessment with developer surveys, systematically benchmarking six widely used tools—Confuzzius, Dlva, Mythril, Osiris, Oyente, and Slither—on 653 real-world contracts for detecting three OWASP Top Ten vulnerabilities: reentrancy, self-destruct, and integer overflow. The findings reveal substantial performance disparities, with F1 scores ranging from 31.2% to 94.6%, an average false positive rate of 32.6%, and per-contract analysis times exceeding 700 seconds. Developer feedback from 150 respondents highlights a lack of trust primarily due to uninterpretable results and slow execution. This work establishes a clear link between tool limitations and adoption barriers, offering empirical insights and actionable directions for future tool development.
📝 Abstract
Smart contracts underpin high-value ecosystems such as decentralized finance (DeFi), yet recurring vulnerabilities continue to cause losses worth billions of dollars. Although numerous security analyzers that detect such flaws exist, real-world attacks remain frequent, raising the question of whether these tools are truly effective or simply under-used due to low developer trust. Prior benchmarks have evaluated analyzers on synthetic or vulnerable-only contract datasets, limiting their ability to measure false positives, false negatives, and usability factors that drive adoption. To close this gap, we present a mixed-methods study that combines large-scale benchmarking with practitioner insights. We evaluate six widely used analyzers (i.e., Confuzzius, Dlva, Mythril, Osiris, Oyente, and Slither) on 653 real-world smart contracts that cover three high-impact vulnerability classes from the OWASP Smart Contract Top Ten (i.e., reentrancy, suicidal contract termination, and integer arithmetic errors). Our results show substantial variation in accuracy (F1 = 31.2 to 94.6%), high false-positive rates (up to 32.6%), and runtimes exceeding 700 seconds per contract. We then survey 150 professional developers and auditors to understand how they use and perceive these tools. Our findings reveal that excessive false positives, vague explanations, and long analysis times are the main barriers to trust and adoption in practice. By linking measurable performance gaps to developer perceptions, we provide concrete recommendations for improving the precision, explainability, and usability of smart-contract security analyzers.