🤖 AI Summary
Binary function-level obfuscation detection remains challenging due to the semantic distortion introduced by diverse obfuscation techniques.
Method: This paper proposes a semantics-aware deep learning approach based on heterogeneous program graphs: it integrates control-flow graphs (CFGs) and data-flow graphs (DFGs) into a multi-source semantic graph and employs graph neural networks (GNNs) to capture cross-instruction and cross-path semantic features.
Contribution/Results: It presents the first systematic validation of GNNs’ effectiveness in learning obfuscation-sensitive semantic representations. We introduce a comprehensive benchmark dataset covering 11 fine-grained obfuscation types across multiple mainstream obfuscators. Experiments demonstrate that our method significantly outperforms conventional static and dynamic baselines on complex obfuscated binaries, achieving 92.7% accuracy in 11-class obfuscation classification. Furthermore, it successfully identifies obfuscated functions in real-world malware samples, confirming strong generalizability and practical applicability.
📝 Abstract
Protecting sensitive program content is a critical issue in various situations, ranging from legitimate use cases to unethical contexts. Obfuscation is one of the most used techniques to ensure such protection. Consequently, attackers must first detect and characterize obfuscation before launching any attack against it. This paper investigates the problem of function-level obfuscation detection using graph-based approaches, comparing algorithms, from elementary baselines to promising techniques like GNN (Graph Neural Networks), on different feature choices. We consider various obfuscation types and obfuscators, resulting in two complex datasets. Our findings demonstrate that GNNs need meaningful features that capture aspects of function semantics to outperform baselines. Our approach shows satisfactory results, especially in a challenging 11-class classification task and in a practical malware analysis example.