LLM-Enhanced Static Analysis for Precise Identification of Vulnerable OSS Versions

📅 2024-08-14

🏛️ arXiv.org

📈 Citations: 9

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing approaches for identifying vulnerable versions in open-source C/C++ software suffer from low precision due to their neglect of vulnerability-irrelevant code and insufficient capability in syntax-level code clone detection. Method: This paper proposes an LLM-driven, semantics-aware vulnerability localization framework. It synergistically applies program slicing and large language models to extract vulnerability-relevant code from patches, then leverages semantic-level code clone detection to compare against historical commits—enabling automated backtracking of the Vulnerability-Introducing Commit (VIC) and precise identification of affected versions. A novel LLM-guided vulnerability context modeling mechanism is introduced, overcoming limitations of conventional syntactic matching. Results: Evaluated on a dataset of 74 vulnerabilities across 1,013 software versions, the framework achieves an F1-score of 92.4%, significantly outperforming state-of-the-art methods; it also corrects 134 mislabeled vulnerable versions in the NVD database.

Technology Category

Application Category

📝 Abstract

Open-source software (OSS) has experienced a surge in popularity, attributed to its collaborative development model and cost-effective nature. However, the adoption of specific software versions in development projects may introduce security risks when these versions bring along vulnerabilities. Current methods of identifying vulnerable versions typically analyze and trace the code involved in vulnerability patches using static analysis with pre-defined rules. They then use syntactic-level code clone detection to identify the vulnerable versions. These methods are hindered by imprecisions due to (1) the inclusion of vulnerability-irrelevant code in the analysis and (2) the inadequacy of syntactic-level code clone detection. This paper presents Vercation, an approach designed to identify vulnerable versions of OSS written in C/C++. Vercation combines program slicing with a Large Language Model (LLM) to identify vulnerability-relevant code from vulnerability patches. It then backtraces historical commits to gather previous modifications of identified vulnerability-relevant code. We propose semantic-level code clone detection to compare the differences between pre-modification and post-modification code, thereby locating the vulnerability-introducing commit (vic) and enabling to identify the vulnerable versions between the patch commit and the vic. We curate a dataset linking 74 OSS vulnerabilities and 1013 versions to evaluate Vercation. On this dataset, our approach achieves the F1 score of 92.4%, outperforming current state-of-the-art methods. More importantly, Vercation detected 134 incorrect vulnerable OSS versions in NVD reports.

Problem

Research questions and friction points this paper is trying to address.

Identifying vulnerable OSS versions using static analysis and LLM

Improving precision in detecting vulnerability-relevant code changes

Correcting inaccurate vulnerable versions in existing security reports

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines program slicing with LLM

Backtracks historical commits for modifications

Uses expanded normalized ASTs for detection

🔎 Similar Papers

LLM-Assisted Static Analysis for Detecting Security Vulnerabilities