🤖 AI Summary
Existing decompiler evaluations rely heavily on synthetic benchmarks or subjective scoring, neglecting semantic fidelity and analyst usability in real-world reverse-engineering scenarios.
Method: We introduce DecompileBench—the first application-oriented, comprehensive evaluation benchmark for decompilation—comprising 23,400 functions from 130 real-world programs. It integrates three novel evaluation modalities: runtime trace matching for semantic validation, LLM-as-Judge human-centered assessment, and multi-dimensional automated metrics.
Contribution/Results: Our systematic evaluation of 12 state-of-the-art decompilers reveals, for the first time, that LLM-based decompilers significantly outperform commercial tools in code understandability (+28.6%), despite lagging by 52.2% in functional correctness. We publicly release DecompileBench—including benchmarks, evaluation toolchain, and results—to support evidence-based tool selection in security analysis and advance a human-centric paradigm for decompiler evaluation.
📝 Abstract
Decompilers are fundamental tools for critical security tasks, from vulnerability discovery to malware analysis, yet their evaluation remains fragmented. Existing approaches primarily focus on syntactic correctness through synthetic micro-benchmarks or subjective human ratings, failing to address real-world requirements for semantic fidelity and analyst usability. We present DecompileBench, the first comprehensive framework that enables effective evaluation of decompilers in reverse engineering workflows through three key components: extit{real-world function extraction} (comprising 23,400 functions from 130 real-world programs), extit{runtime-aware validation}, and extit{automated human-centric assessment} using LLM-as-Judge to quantify the effectiveness of decompilers in reverse engineering workflows. Through a systematic comparison between six industrial-strength decompilers and six recent LLM-powered approaches, we demonstrate that LLM-based methods surpass commercial tools in code understandability despite 52.2% lower functionality correctness. These findings highlight the potential of LLM-based approaches to transform human-centric reverse engineering. We open source href{https://github.com/Jennieett/DecompileBench}{DecompileBench} to provide a framework to advance research on decompilers and assist security experts in making informed tool selections based on their specific requirements.