Binary Diff Summarization using Large Language Models

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the security challenge of rapidly identifying malicious changes in binary updates within software supply chains, this paper proposes a large language model (LLM)-based natural language summarization method for binary differences. The approach integrates reverse engineering, binary diffing, and LLM-driven semantic-level change summarization. Key contributions include: (1) the first benchmark dataset for binary differencing tailored to software supply chain security; (2) the Function Sensitivity Scoring (FSS) algorithm, enabling function-level quantification and automatic classification of sensitivity; and (3) an end-to-end pipeline that bridges low-level binary analysis with high-level natural language interpretation. Evaluated on the custom benchmark, the method achieves 0.98 precision and 0.64 recall in malware detection; FSS attains a separation score of 3.0 between malicious and benign functions. Furthermore, the framework successfully reproduces and identifies the XZ Utils backdoor incident.

Technology Category

Application Category

📝 Abstract
Security of software supply chains is necessary to ensure that software updates do not contain maliciously injected code or introduce vulnerabilities that may compromise the integrity of critical infrastructure. Verifying the integrity of software updates involves binary differential analysis (binary diffing) to highlight the changes between two binary versions by incorporating binary analysis and reverse engineering. Large language models (LLMs) have been applied to binary analysis to augment traditional tools by producing natural language summaries that cybersecurity experts can grasp for further analysis. Combining LLM-based binary code summarization with binary diffing can improve the LLM's focus on critical changes and enable complex tasks such as automated malware detection. To address this, we propose a novel framework for binary diff summarization using LLMs. We introduce a novel functional sensitivity score (FSS) that helps with automated triage of sensitive binary functions for downstream detection tasks. We create a software supply chain security benchmark by injecting 3 different malware into 6 open-source projects which generates 104 binary versions, 392 binary diffs, and 46,023 functions. On this, our framework achieves a precision of 0.98 and recall of 0.64 for malware detection, displaying high accuracy with low false positives. Across malicious and benign functions, we achieve FSS separation of 3.0 points, confirming that FSS categorization can classify sensitive functions. We conduct a case study on the real-world XZ utils supply chain attack; our framework correctly detects the injected backdoor functions with high FSS.
Problem

Research questions and friction points this paper is trying to address.

Summarizing binary code changes using LLMs for security analysis
Automating malware detection in software updates via differential analysis
Identifying sensitive functions in binaries with functional sensitivity scoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes LLM-based binary diff summarization framework
Introduces functional sensitivity score for triage
Detects malware with high precision and recall
🔎 Similar Papers
No similar papers found.