🤖 AI Summary
Existing binary corpora lack unified modeling of cross-compiler builds, multi-version evolution, and vulnerability labels. This work introduces a queryable binary dataset encompassing 248 open-source projects, multiple compilers and optimization levels, and historical versions spanning several years, integrating build diversity, temporal dynamics, and CVE annotations into a cohesive framework for the first time. The dataset links binaries to their corresponding source code, functions, debug information, and version metadata via database indexing, enabling multidimensional analysis through LLM benchmarks, embedding models (jTrans, MalConv), and TLSH hashing. Experiments reveal that large language models rely on build artifacts rather than semantic reasoning, quantify the clustering behavior of versions in embedding space, and employ Bayesian regression to disentangle the sources of binary similarity, thereby demonstrating the dataset’s value for fine-grained, traceable binary analysis.
📝 Abstract
Existing binary corpora typically capture only one or two axes of binary variation: they either provide cross-compiler builds without a temporal axis, or CVE labels for single-build binaries. None combine cross-build diversity, cross-version history, and CVE labels into a queryable structure. We present ASSEMBLAGE-DEEPHISTORY, which consolidates these dimensions into a unified framework where every binary's compilation context, source code, vulnerable functions, and package version are stored as first-class metadata.
ASSEMBLAGE-DEEPHISTORY comprises 73,610 binaries spanning 248 open-source projects, compiled across GCC, Clang, and MSVC at multiple optimization levels on Linux and Windows, with multi-year historical builds. Each binary is indexed in a database that links it to its source code, functions, debug info, variant builds, historical versions, and vulnerable functions. Three analyses demonstrate this structure's value: (1) a three-stage LLM benchmark (recognition, strategy-guided detection, and cross-build transfer) to test whether LLMs reason about binary vulnerabilities or pattern-match on build-specific artifacts; (2) a comparison of MalConv embeddings, jTrans function embeddings, and TLSH fuzzy hashes quantifying how same-package versions cluster in each space; and (3) a Bayesian regression decomposing binary similarity into contributions from temporal distance, file changes, and commits.