Improving the Context Length and Efficiency of Code Retrieval for Tracing Security Vulnerability Fixes

📅 2025-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address patch information scarcity in vulnerability databases (e.g., NVD, GitHub Advisory Database)—which causes delayed remediation, biased impact assessment, and elevated exploitation risk—this paper proposes a CVE-driven, repository-scale patch localization method. It pioneers the integration of GritLM’s infinite-context modeling capability with hierarchical code semantic representation to enable fine-grained cross-modal matching between CVE descriptions and full-diff patches. A scalable retrieval framework is built upon ElasticSearch and Learning-to-Rank. Evaluated on multiple benchmarks, our method achieves substantially higher recall than PatchFinder, PatchScout, VFCFinder, and the commercial SOTA Voyage embedding API—up to +28%. Key contributions are: (i) the first long-context-aware, high-precision semantic retrieval system for CVE-to-patch alignment; and (ii) the first open-source patch localization framework that simultaneously supports repository-scale indexing and fine-grained semantic alignment.

Technology Category

Application Category

📝 Abstract
In recent years, the rapid increase of security vulnerabilities has caused major challenges in managing them. One critical task in vulnerability management is tracing the patches that fix a vulnerability. By accurately tracing the patching commits, security stakeholders can precisely identify affected software components, determine vulnerable and fixed versions, assess the severity etc., which facilitates rapid deployment of mitigations. However, previous work has shown that the patch information is often missing in vulnerability databases, including both the National Vulnerability Databases (NVD) and the GitHub Advisory Database, which increases the risk of delayed mitigation, incorrect vulnerability assessment, and potential exploits. Although existing work has proposed several approaches for patch tracing, they suffer from two major challenges: (1) the lack of scalability to the full-repository level, and (2) the lack of study on how to model the semantic similarity between the CVE and the full diff code. Upon identifying this gap, we propose SITPatchTracer, a scalable full-repo full-context retrieval system for security vulnerability patch tracing. SITPatchTracer leverages ElasticSearch, learning-to-rank, and a hierarchical embedding approach based on GritLM, a top-ranked LLM for text embedding with unlimited context length and fast inference speed. The evaluation of SITPatchTracer shows that it achieves a high recall on both evaluated datasets. SITPatchTracer's recall not only outperforms several existing works (PatchFinder, PatchScout, VFCFinder), but also Voyage, the SOTA commercial code embedding API by 13% and 28%.
Problem

Research questions and friction points this paper is trying to address.

Tracing security vulnerability patches in large code repositories
Improving semantic similarity modeling between CVEs and code diffs
Enhancing scalability and efficiency of patch retrieval systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses ElasticSearch for scalable full-repo retrieval
Applies learning-to-rank for improved patch tracing
Leverages GritLM for hierarchical embedding approach
🔎 Similar Papers
No similar papers found.
X
Xueqing Liu
Stevens Institute of Technology, USA
Jiangrui Zheng
Jiangrui Zheng
Stevens Institute of Technology, USA
Guanqun Yang
Guanqun Yang
Ph.D. Candidate in Computer Science, Stevens Institute of Technology
Machine LearningNatural Language Processing
S
Siyan Wen
Stevens Institute of Technology, USA
Q
Qiushi Liu
ZJU-UIUC Institute, China