🤖 AI Summary
This work addresses the challenge of vulnerability analysis in scenarios where only binary patches are available, without access to source code or security advisories. It proposes an end-to-end method for semantic vulnerability reconstruction that uniquely integrates language model agents with binary diffing analysis. By leveraging Ghidra and Ghidriff to compare patched and unpatched ELF binaries, the approach identifies and ranks divergent functions, which are then processed by a local, offline language model agent to generate audit hypotheses, verification plans, and vulnerability attributions—entirely without external network connectivity or source code dependencies. Evaluated on 20 Ubuntu security updates, the method successfully locates 10 genuine vulnerable functions and correctly attributes 11 vulnerabilities. Its efficacy is further demonstrated through a tcpdump behavioral-difference case study, wherein all negative samples are accurately classified as unknown.
📝 Abstract
Security updates create a short but important window in which defenders and attackers can compare vulnerable and patched software. Yet in many operational settings, the most accessible artifacts are binary packages rather than source patches or advisory text. This paper asks whether a language-model agent, restricted to local binary-derived evidence, can reconstruct the security meaning of Linux distribution updates. Patch2Vuln is a local, resumable pipeline that extracts old/new ELF pairs, diffs them with Ghidra and Ghidriff, ranks changed functions, builds candidate dossiers, and asks an offline agent to produce a preliminary audit, bounded validation plan, and final audit.
We evaluate Patch2Vuln on 25 Ubuntu `.deb` package pairs: 20 security-update pairs and five negative controls, all manually adjudicated against private source-patch and binary-function ground truth. The agent localizes a verified security-relevant patch function in 10 of 20 security pairs and assigns an accepted final root-cause class in 11 of 20. Oracle diagnostics show that six security pairs fail before model reasoning because the binary differ or ranker omits the right function, with one additional context-export miss. A separate bounded validation pass produces two target-level minimized behavioral old/new differentials, both for tcpdump, but no crash, timeout, sanitizer finding, or memory-corruption proof; all five negative controls are classified as unknown and produce no validation differentials. These results support agentic vulnerability reconstruction from binary patches as a useful research target while showing that binary-diff coverage and local behavioral validation remain the limiting components.