Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing binary decompilation research is hindered by the lack of a large-scale, high-fidelity, real-world binary–source function-pair benchmark. Method: We introduce Decompile-Bench, the first million-scale open-source benchmark, curated from 10 billion candidate pairs to yield 2 million high-quality function pairs spanning diverse compilers, optimization levels, and inlining scenarios. We propose Decompile-Bench-Eval—a leakage-resistant evaluation subset—and demonstrate that re-execution rate is a more robust metric than BLEU or CodeBLEU. Our methodology integrates static analysis, symbolic execution, and multi-configuration compilation verification, augmented with HumanEval/MBPP test cases. Contribution/Results: Fine-tuning LLM-based decompilers on Decompile-Bench improves re-execution rate by 20%. The dataset and evaluation framework are publicly released (Hugging Face/GitHub) to enable standardized, reproducible assessment.

Technology Category

Application Category

📝 Abstract
Recent advances in LLM-based decompilers have been shown effective to convert low-level binaries into human-readable source code. However, there still lacks a comprehensive benchmark that provides large-scale binary-source function pairs, which is critical for advancing the LLM decompilation technology. Creating accurate binary-source mappings incurs severe issues caused by complex compilation settings and widespread function inlining that obscure the correspondence between binaries and their original source code. Previous efforts have either relied on used contest-style benchmarks, synthetic binary-source mappings that diverge significantly from the mappings in real world, or partially matched binaries with only code lines or variable names, compromising the effectiveness of analyzing the binary functionality. To alleviate these issues, we introduce Decompile-Bench, the first open-source dataset comprising two million binary-source function pairs condensed from 100 million collected function pairs, i.e., 450GB of binaries compiled from permissively licensed GitHub projects. For the evaluation purposes, we also developed a benchmark Decompile-Bench-Eval including manually crafted binaries from the well-established HumanEval and MBPP, alongside the compiled GitHub repositories released after 2025 to mitigate data leakage issues. We further explore commonly-used evaluation metrics to provide a thorough assessment of the studied LLM decompilers and find that fine-tuning with Decompile-Bench causes a 20% improvement over previous benchmarks in terms of the re-executability rate. Our code and data has been released in HuggingFace and Github. https://github.com/albertan017/LLM4Decompile
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale binary-source function pairs for decompilation benchmarks
Inaccurate binary-source mappings due to complex compilation and function inlining
Previous benchmarks use synthetic or partial data, limiting real-world effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale binary-source function pairs dataset
Open-source dataset with 2 million pairs
Fine-tuning improves re-executability by 20%
🔎 Similar Papers
No similar papers found.
Hanzhuo Tan
Hanzhuo Tan
the Hong Kong Polytechnic University
Code GenerationDecompilationBinary Analysis
X
Xiaolong Tian
Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology
H
Hanrui Qi
Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology
J
Jiaming Liu
Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology
Zuchen Gao
Zuchen Gao
Phd Candidate of The Hong Kong Polytechnic University
S
Siyi Wang
Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology
Q
Qi Luo
Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology
J
Jing Li
Department of Computing, The Hong Kong Polytechnic University
Y
Yuqun Zhang
Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology