🤖 AI Summary
Existing static binary data-flow analysis lacks systematic evaluation, suffers from low precision (0.13), and exhibits poorly understood bottlenecks. Method: We introduce the first large-scale, manually annotated benchmark—comprising over 215K micro-benchmarks—and conduct the first quantitative evaluation of three prominent frameworks (angr, Ghidra, and Miasm), revealing pervasive precision limitations. To address these, we propose three model extensions: (i) dynamic data-flow-guided constraint strengthening, (ii) cross-function contextual modeling, and (iii) joint precision–recall optimization. Contribution/Results: Our approach achieves a recall of 0.99 and precision of 0.32—representing a 146% precision improvement over baselines—while significantly outperforming prior methods. Further validation on real-world CVE samples demonstrates its effectiveness in identifying vulnerable instructions, yielding substantial performance gains in practical vulnerability detection scenarios.
📝 Abstract
Data-flow analysis is a critical component of security research. Theoretically, accurate data-flow analysis in binary executables is an undecidable problem, due to complexities of binary code. Practically, many binary analysis engines offer some data-flow analysis capability, but we lack understanding of the accuracy of these analyses, and their limitations. We address this problem by introducing a labeled benchmark data set, including 215,072 microbenchmark test cases, mapping to 277,072 binary executables, created specifically to evaluate data- flow analysis implementations. Additionally, we augment our benchmark set with dynamically-discovered data flows from 6 real-world executables. Using our benchmark data set, we evaluate three state of the art data-flow analysis implementations, in angr, Ghidra and Miasm and discuss their very low accuracy and reasons behind it. We further propose three model extensions to static data-flow analysis that significantly improve accuracy, achieving almost perfect recall (0.99) and increasing precision from 0.13 to 0.32. Finally, we show that leveraging these model extensions in a vulnerability-discovery context leads to a tangible improvement in vulnerable instruction identification.