π€ AI Summary
Existing neural decompilers struggle with user-defined composite types (e.g., structs and classes), leading to semantic loss, poor readability, and heavy reliance on test cases. Method: We propose the first joint code-and-type prediction paradigm, introducing a cross-procedural control- and data-flow-aware contextual modeling mechanism, and construct Realtypeβthe first benchmark dataset featuring real-world composite types. Leveraging a large language model fine-tuning framework, we design a multi-task joint decoding architecture. Contribution/Results: Our approach achieves state-of-the-art performance on Realtype. We open-source the lightweight and efficient Idioms model series, significantly improving semantic completeness and readability of decompiled output. This work advances both open scientific research and industrial-grade reverse engineering.
π Abstract
Decompilers are important tools for reverse engineers that help them analyze software at a higher level of abstraction than assembly. Unfortunately, because compilation is lossy, deterministic decompilers produce code that is missing many of the details that make source code readable in the first place, like variable names and types. Neural decompilers, on the other hand, offer the ability to statistically fill in these details. Existing work in neural decompilation, however, suffers from substantial drawbacks that limits its ability to handle real code: it is unable to handle user-defined composite types, which are essential to fully specifying many functions' semantics, or require test cases. In this work, we introduce a new training process to finetune any LLM into a neural decompiler capable of generating the appropriate user-defined types alongside the decompilation. We introduce a new dataset, Realtype, that includes substantially more complicated and realistic types than existing neural decompilation benchmarks. Motivated by the intuition that different parts of data structures can be operated upon by different parts of the program, we show that interprocedural context can help improve neural decompilers' ability to handle user-defined types. We show that our training process yields state-of-the-art results in neural decompilation. We also publicly release the Idioms series of finetuned neural decompilation models in support of open science. In summary, we identify the need for joint code and type prediction, show that it is a hard problem, and take the first steps towards solving it.