BinPool: A Dataset of Vulnerabilities for Binary Security Analysis

📅 2025-04-27

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Current binary vulnerability detection is hindered by the scarcity of public datasets, semantic homogeneity, and the absence of authentic vulnerable–patched binary pairs. To address these bottlenecks, we propose BinPool—the first large-scale, open-source, real-world binary vulnerability dataset. It comprises 6,144 binaries from 162 Debian packages, covering 603 CVEs and 89 CWE types, with full coverage of four compilation optimization levels (−O0 to −O3). We introduce an automated pipeline that crawls Debian’s historical package versions and performs differential analysis to scale the acquisition of authentic vulnerable–patched binary pairs. Crucially, we integrate reproducible compilation and CVE–CWE mapping validation to eliminate manual injection, labeling errors, and semantic distortion. BinPool has already enabled benchmarking across multiple tasks, including vulnerability detection, function-level similarity analysis, and binary plagiarism detection.

Technology Category

Application Category

📝 Abstract

The development of machine learning techniques for discovering software vulnerabilities relies fundamentally on the availability of appropriate datasets. The ideal dataset consists of a large and diverse collection of real-world vulnerabilities, paired so as to contain both vulnerable and patched versions of each program. Naturally, collecting such datasets is a laborious and time-consuming task. Within the specific domain of vulnerability discovery in binary code, previous datasets are either publicly unavailable, lack semantic diversity, involve artificially introduced vulnerabilities, or were collected using static analyzers, thereby themselves containing incorrectly labeled example programs. In this paper, we describe a new publicly available dataset which we dubbed Binpool, containing numerous samples of vulnerable versions of Debian packages across the years. The dataset was automatically curated, and contains both vulnerable and patched versions of each program, compiled at four different optimization levels. Overall, the dataset covers 603 distinct CVEs across 89 CWE classes, 162 Debian packages, and contains 6144 binaries. We argue that this dataset is suitable for evaluating a range of security analysis tools, including for vulnerability discovery, binary function similarity, and plagiarism detection.

Problem

Research questions and friction points this paper is trying to address.

Lack of diverse real-world vulnerability datasets for binary analysis

Existing datasets have incorrect labels or artificial vulnerabilities

Need for public dataset with vulnerable and patched binary versions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically curated dataset of vulnerabilities

Includes vulnerable and patched program versions

Compiled at four optimization levels

🔎 Similar Papers

LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

2024-05-27arXiv.orgCitations: 28

ByteDance

圣何塞

Machine Learning Engineer