SK2Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

Existing LLM-based binary decompilers struggle to accurately recover source-level control-flow structures and semantically consistent identifiers. This paper proposes SK2Decompile, the first decompiler adopting a “skeleton→skin” two-stage paradigm: (1) a structure recovery stage that leverages IR transformation and a dedicated structural model to precisely reconstruct control-flow graphs and data structures—forming the *semantic skeleton*; and (2) an identifier naming stage employing a specialized model to generate human-readable, semantically faithful identifiers—the *semantic skin*—with reinforcement learning optimizing compilation compliance and cross-stage consistency. This decoupled design independently enhances structural correctness and lexical readability. Experiments demonstrate that SK2Decompile achieves a 21.6% higher re-execution rate on HumanEval compared to GPT-5-mini, and outperforms Idioms by 29.4% in the R2I metric on the GitHub2025 benchmark.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have emerged as a promising approach for binary decompilation. However, the existing LLM-based decompilers still are somewhat limited in effectively presenting a program's source-level structure with its original identifiers. To mitigate this, we introduce SK2Decompile, a novel two-phase approach to decompile from the skeleton (semantic structure) to the skin (identifier) of programs. Specifically, we first apply a Structure Recovery model to translate a program's binary code to an Intermediate Representation (IR) as deriving the program's "skeleton", i.e., preserving control flow and data structures while obfuscating all identifiers with generic placeholders. We also apply reinforcement learning to reward the model for producing program structures that adhere to the syntactic and semantic rules expected by compilers. Second, we apply an Identifier Naming model to produce meaningful identifiers which reflect actual program semantics as deriving the program's "skin". We train the Identifier Naming model with a separate reinforcement learning objective that rewards the semantic similarity between its predictions and the reference code. Such a two-phase decompilation process facilitates advancing the correctness and readability of decompilation independently. Our evaluations indicate that SK2Decompile, significantly outperforms the SOTA baselines, achieving 21.6% average re-executability rate gain over GPT-5-mini on the HumanEval dataset and 29.4% average R2I improvement over Idioms on the GitHub2025 benchmark.

Problem

Research questions and friction points this paper is trying to address.

Recovering program structure from binary code

Generating meaningful identifiers for decompilation

Improving decompilation correctness and readability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase decompilation from skeleton to skin

Structure recovery with reinforcement learning

Identifier naming with semantic similarity rewards

🔎 Similar Papers

Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning