Identifying Obfuscated Code through Graph-Based Semantic Analysis of Binary Code

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Binary function-level obfuscation detection remains challenging due to the semantic distortion introduced by diverse obfuscation techniques. Method: This paper proposes a semantics-aware deep learning approach based on heterogeneous program graphs: it integrates control-flow graphs (CFGs) and data-flow graphs (DFGs) into a multi-source semantic graph and employs graph neural networks (GNNs) to capture cross-instruction and cross-path semantic features. Contribution/Results: It presents the first systematic validation of GNNs’ effectiveness in learning obfuscation-sensitive semantic representations. We introduce a comprehensive benchmark dataset covering 11 fine-grained obfuscation types across multiple mainstream obfuscators. Experiments demonstrate that our method significantly outperforms conventional static and dynamic baselines on complex obfuscated binaries, achieving 92.7% accuracy in 11-class obfuscation classification. Furthermore, it successfully identifies obfuscated functions in real-world malware samples, confirming strong generalizability and practical applicability.

Technology Category

Application Category

📝 Abstract

Protecting sensitive program content is a critical issue in various situations, ranging from legitimate use cases to unethical contexts. Obfuscation is one of the most used techniques to ensure such protection. Consequently, attackers must first detect and characterize obfuscation before launching any attack against it. This paper investigates the problem of function-level obfuscation detection using graph-based approaches, comparing algorithms, from elementary baselines to promising techniques like GNN (Graph Neural Networks), on different feature choices. We consider various obfuscation types and obfuscators, resulting in two complex datasets. Our findings demonstrate that GNNs need meaningful features that capture aspects of function semantics to outperform baselines. Our approach shows satisfactory results, especially in a challenging 11-class classification task and in a practical malware analysis example.

Problem

Research questions and friction points this paper is trying to address.

Detects function-level obfuscation using graph-based methods

Compares GNNs and baselines on semantic feature effectiveness

Evaluates performance on diverse obfuscation types and datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based semantic analysis for obfuscated code

Comparison of GNNs and baseline algorithms

Feature selection capturing function semantics

🔎 Similar Papers

FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs