🤖 AI Summary
This study addresses the lack of a unified and rigorous semantic consistency evaluation framework for smart contract decompilers, particularly their inability to detect semantically incorrect yet superficially plausible Solidity code generated by large language models. To this end, the authors construct a benchmark dataset comprising 600 real-world contracts, including their bytecode, source code, and replayable semantic checkpoints, and propose the first four-stage evaluation framework encompassing format completeness, compilability, ABI recovery, and differential replay-based semantic consistency. Zero-shot and repair-based experiments with state-of-the-art models—including Claude Opus 4.7, GPT-5.3-Codex, and GLM-5—reveal that even the best-performing model perfectly decompiles only 42 contracts. The introduction of an in-model compilation-based repair mechanism substantially improves performance, underscoring that semantic consistency remains a fundamental challenge in decompilation.
📝 Abstract
Smart contract decompilation aims to recover high-level source code from bytecode, but evaluating decompilers remains difficult because existing studies use narrow datasets, inconsistent metrics, and limited semantic consistency checks. This gap is increasingly important as large language models (LLMs) begin to generate source-like Solidity that may compile and appear plausible, even when its semantics diverge from the original contract. We introduce SCDBench, a dataset and benchmark methodology for LLM-based smart contract decompilation. The dataset contains 600 real-world Solidity contracts with paired bytecode inputs, ground-truth source code, and replayable semantic checkpoints. SCDBench evaluates decompiler outputs through four cumulative stages: format completeness, compilability, Application Binary Interface (ABI) recovery, and semantic consistency via differential replay. We evaluate Claude Opus 4.7, GPT-5.3-Codex, and GLM-5 in a zero-shot decompilation setting, including GLM-5 variants with and without extended reasoning and a zero-shot compilation-repair setting. The results show that frontier LLMs can often produce structured and compilable Solidity, but achieving semantic consistency remains far from solved: the best-performing frontier model perfectly decompiles only 42/600 contracts. We further show that introducing same-model compilation repair substantially improves performance at modest additional cost. SCDBench establishes a common ground for rigorous, reproducible evaluation and aims to accelerate the development of reliable smart contract decompilers for blockchain security and transparency.