Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing binary code similarity detection methods struggle to balance interpretability, generalizability, and scalability: handcrafted features are human-readable but exhibit poor generalization, while embedding-based approaches are robust yet opaque, difficult to verify, and suffer from low retrieval efficiency. This paper introduces the first language-model-based agent framework for structured reasoning over assembly code, enabling semantic parsing that automatically generates interpretable and verifiable structured features—including input/output types, side effects, and critical constants. By integrating inverted indexing with relational indexing, the method achieves efficient and scalable retrieval. Crucially, it requires no training and attains recall@1 of 42% (cross-architecture) and 62% (cross-optimization), matching state-of-the-art supervised embedding methods. When fused with embeddings, it significantly surpasses prior art.

Technology Category

Application Category

📝 Abstract

Binary code similarity detection is a core task in reverse engineering. It supports malware analysis and vulnerability discovery by identifying semantically similar code in different contexts. Modern methods have progressed from manually engineered features to vector representations. Hand-crafted statistics (e.g., operation ratios) are interpretable, but shallow and fail to generalize. Embedding-based methods overcome this by learning robust cross-setting representations, but these representations are opaque vectors that prevent rapid verification. They also face a scalability-accuracy trade-off, since high-dimensional nearest-neighbor search requires approximations that reduce precision. Current approaches thus force a compromise between interpretability, generalizability, and scalability. We bridge these gaps using a language model-based agent to conduct structured reasoning analysis of assembly code and generate features such as input/output types, side effects, notable constants, and algorithmic intent. Unlike hand-crafted features, they are richer and adaptive. Unlike embeddings, they are human-readable, maintainable, and directly searchable with inverted or relational indexes. Without any matching training, our method respectively achieves 42% and 62% for recall@1 in cross-architecture and cross-optimization tasks, comparable to embedding methods with training (39% and 34%). Combined with embeddings, it significantly outperforms the state-of-the-art, demonstrating that accuracy, scalability, and interpretability can coexist.

Problem

Research questions and friction points this paper is trying to address.

Bridge gaps between interpretability, generalizability, and scalability in binary code similarity

Generate human-readable features from assembly code using structured reasoning

Overcome limitations of opaque embeddings and shallow hand-crafted statistics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language model agent generates structured assembly code features

Features are human-readable and directly searchable via indexes

Combines interpretable features with embeddings to enhance performance

🔎 Similar Papers

COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning