π€ AI Summary
In reverse engineering, accurate disassembly of obfuscated binaries faces core challenges including ambiguous instruction boundaries, unreliable function entry identification, and poor robustness of traditional heuristic approaches. This paper proposes the first learning-based disassembler integrating multi-head self-attention with memory-block-aware value-set analysis. Our method models cross-instruction semantic dependencies to precisely locate function entries and instruction boundaries, while employing superset instruction representations to better capture illegal jumps and obfuscation patterns. Evaluated on multiple obfuscation benchmarks, our framework achieves a 13.2% improvement in function entry identification (F1-score), an 18.5% gain in memory-block analysis accuracy, and a 4.4% reduction in the average number of spurious indirect call targets per control-flow graph node. These results demonstrate significant improvements in both accuracy and robustness of static disassembly under strong binary obfuscation.
π Abstract
For reverse engineering related security domains, such as vulnerability detection, malware analysis, and binary hardening, disassembly is crucial yet challenging. The fundamental challenge of disassembly is to identify instruction and function boundaries. Classic approaches rely on file-format assumptions and architecture-specific heuristics to guess the boundaries, resulting in incomplete and incorrect disassembly, especially when the binary is obfuscated. Recent advancements of disassembly have demonstrated that deep learning can improve both the accuracy and efficiency of disassembly. In this paper, we propose Disa, a new learning-based disassembly approach that uses the information of superset instructions over the multi-head self-attention to learn the instructions' correlations, thus being able to infer function entry-points and instruction boundaries. Disa can further identify instructions relevant to memory block boundaries to facilitate an advanced block-memory model based value-set analysis for an accurate control flow graph (CFG) generation. Our experiments show that Disa outperforms prior deep-learning disassembly approaches in function entry-point identification, especially achieving 9.1% and 13.2% F1-score improvement on binaries respectively obfuscated by the disassembly desynchronization technique and popular source-level obfuscator. By achieving an 18.5% improvement in the memory block precision, Disa generates more accurate CFGs with a 4.4% reduction in Average Indirect Call Targets (AICT) compared with the state-of-the-art heuristic-based approach.