🤖 AI Summary
Virtualization-obfuscated binary code is typically large and structurally complex, exceeding the input length limits of large language models (LLMs) and lacking annotated data, which hinders direct application in code analysis. To address this challenge, this work proposes a structure-role-oriented decomposition and automatic labeling paradigm: it employs static analysis to partition obfuscated code into maximally sized, semantically coherent units that conform to LLM input constraints, and automatically annotates each unit based on its structural role within the control flow graph. This approach enables the construction of a scalable dataset for both training and inference. Evaluation on real-world virtualization-obfuscated binaries demonstrates that the prototype system achieves efficient and accurate analysis, effectively overcoming the dual bottlenecks of input length limitations and data scarcity that currently impede LLM-based approaches in this domain.
📝 Abstract
Virtualization-based obfuscation produces extremely large and structurally complex binaries, posing challenges for LLM-based analysis due to input size limits and the need for large-scale labeled data. We address this by focusing on structural rather than full semantic analysis. Obfuscated binaries are decomposed into the largest semantically coherent units that fit within LLM constraints and are labeled according to their structural roles. We implement a static analysis framework to automate labeling and enable large-scale dataset generation. Our prototype shows strong performance on real-world virtualization obfuscators.