EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the lack of a realistic and general-purpose benchmark for binary function similarity detection (BFSD), which has hindered effective evaluation of model generalization. To bridge this gap, we introduce EXHIB—the first comprehensive benchmark encompassing five real-world datasets that span critical dimensions including compiler optimizations, architectural disparities, code obfuscation, and high-level semantic variations. Using EXHIB, we systematically evaluate nine state-of-the-art BFSD models and observe performance degradations of up to 30% in firmware and semantically complex scenarios. These findings reveal a significant deficiency in the robustness of current methods against high-level semantic differences and expose a critical flaw in existing evaluation protocols. EXHIB thus establishes a more realistic, diverse, and reproducible foundation for future BFSD research.

Technology Category

Application Category

📝 Abstract

Binary Function Similarity Detection (BFSD) is a core problem in software security, supporting tasks such as vulnerability analysis, malware classification, and patch provenance. In the past few decades, numerous models and tools have been developed for this application; however, due to the lack of a comprehensive universal benchmark in this field, researchers have struggled to compare different models effectively. Existing datasets are limited in scope, often focusing on a narrow set of transformations or types of binaries, and fail to reflect the full diversity of real-world applications. We introduce EXHIB, a benchmark comprising five realistic datasets collected from the wild, each highlighting a distinct aspect of the BFSD problem space. We evaluate 9 representative models spanning multiple BFSD paradigms on EXHIB and observe performance degradations of up to 30% on firmware and semantic datasets compared to standard settings, revealing substantial generalization gaps. Our results show that robustness to low- and mid-level binary variations does not generalize to high-level semantic differences, underscoring a critical blind spot in current BFSD evaluation practices.

Problem

Research questions and friction points this paper is trying to address.

Binary Function Similarity Detection

benchmark

real-world diversity

model evaluation

generalization gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

Binary Function Similarity Detection

Benchmark

Real-world Evaluation