🤖 AI Summary
Existing decompilers often produce code that is syntactically incorrect, uncompilable, or behaviorally inaccurate, limiting practical utility. This work proposes MCGD, a multi-agent collaborative framework that introduces, for the first time, a three-tiered constraint system enforcing syntactic correctness, compilability, and behavioral equivalence. By integrating execution-based validation with an LLM-driven iterative repair mechanism, MCGD achieves high-fidelity, re-executable source code recovery. The approach leverages GPT-4o to construct specialized repair agents and employs hierarchical feedback to guide optimization. Evaluated on 1,641 real-world binary samples, MCGD attains a re-executability rate of 84–97%, outperforming baseline methods by 28–89 percentage points and significantly surpassing existing LLM-based decompilation techniques. Over 90% of samples converge within two iterations, with an average cost per sample of only $0.03–0.05.
📝 Abstract
Decompilation -- recovering source code from compiled binaries -- is essential for security analysis, malware reverse engineering, and legacy software maintenance. However, existing decompilers produce code that often fails to compile or execute correctly, limiting their practical utility. We present a multi-agent framework that transforms decompiled code into re-executable source through Multi-level Constraint-Guided Decompilation (MCGD). Our approach employs a hierarchical validation pipeline with three constraint levels: (1) syntactic correctness via parsing, (2) compilability via GCC, and (3) behavioral equivalence via LLM-generated test cases. When validation fails, specialized LLM agents iteratively refine the code using structured error feedback. We evaluate our framework on 1,641 real-world binaries from ExeBench across three decompilers (RetDec, Ghidra, and Angr). Our framework achieves 84-97% re-executability, improving baseline decompiler output by 28-89 percentage points. In comparison with state-of-the-art LLM-based decompilation methods using the same GPT-4o backbone, our approach (84.1%) outperforms LLM4Decompile (80.3%), SK2Decompile (73.9%), and SALT4Decompile (61.8%). Our ablation study reveals that execution-based validation is critical: compile-only approaches achieve 0% behavioral correctness despite 91-99% compilation rates. The system converges efficiently, with 90%+ binaries reaching correctness within 2 iterations at an average cost of $0.03-0.05 per binary. Our results demonstrate that constraint-guided agentic refinement can bridge the gap between raw decompiler output and practically useful source code.