Towards LLM-Based Analysis of Virtualization-Obfuscated Code through Automated Data Generation

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

235K/year
🤖 AI Summary
Virtualization-obfuscated binary code is typically large and structurally complex, exceeding the input length limits of large language models (LLMs) and lacking annotated data, which hinders direct application in code analysis. To address this challenge, this work proposes a structure-role-oriented decomposition and automatic labeling paradigm: it employs static analysis to partition obfuscated code into maximally sized, semantically coherent units that conform to LLM input constraints, and automatically annotates each unit based on its structural role within the control flow graph. This approach enables the construction of a scalable dataset for both training and inference. Evaluation on real-world virtualization-obfuscated binaries demonstrates that the prototype system achieves efficient and accurate analysis, effectively overcoming the dual bottlenecks of input length limitations and data scarcity that currently impede LLM-based approaches in this domain.
📝 Abstract
Virtualization-based obfuscation produces extremely large and structurally complex binaries, posing challenges for LLM-based analysis due to input size limits and the need for large-scale labeled data. We address this by focusing on structural rather than full semantic analysis. Obfuscated binaries are decomposed into the largest semantically coherent units that fit within LLM constraints and are labeled according to their structural roles. We implement a static analysis framework to automate labeling and enable large-scale dataset generation. Our prototype shows strong performance on real-world virtualization obfuscators.
Problem

Research questions and friction points this paper is trying to address.

virtualization-based obfuscation
LLM-based analysis
large-scale labeled data
input size limits
structurally complex binaries
Innovation

Methods, ideas, or system contributions that make the work stand out.

virtualization-based obfuscation
LLM-based analysis
structural decomposition
automated labeling
large-scale dataset generation
🔎 Similar Papers
2024-03-27ACM Transactions on Software Engineering and MethodologyCitations: 2