Improving the Context Length and Efficiency of Code Retrieval for Tracing Security Vulnerability Fixes

📅 2025-03-29

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address patch information scarcity in vulnerability databases (e.g., NVD, GitHub Advisory Database)—which causes delayed remediation, biased impact assessment, and elevated exploitation risk—this paper proposes a CVE-driven, repository-scale patch localization method. It pioneers the integration of GritLM’s infinite-context modeling capability with hierarchical code semantic representation to enable fine-grained cross-modal matching between CVE descriptions and full-diff patches. A scalable retrieval framework is built upon ElasticSearch and Learning-to-Rank. Evaluated on multiple benchmarks, our method achieves substantially higher recall than PatchFinder, PatchScout, VFCFinder, and the commercial SOTA Voyage embedding API—up to +28%. Key contributions are: (i) the first long-context-aware, high-precision semantic retrieval system for CVE-to-patch alignment; and (ii) the first open-source patch localization framework that simultaneously supports repository-scale indexing and fine-grained semantic alignment.

Technology Category

Application Category

📝 Abstract

In recent years, the rapid increase of security vulnerabilities has caused major challenges in managing them. One critical task in vulnerability management is tracing the patches that fix a vulnerability. By accurately tracing the patching commits, security stakeholders can precisely identify affected software components, determine vulnerable and fixed versions, assess the severity etc., which facilitates rapid deployment of mitigations. However, previous work has shown that the patch information is often missing in vulnerability databases, including both the National Vulnerability Databases (NVD) and the GitHub Advisory Database, which increases the risk of delayed mitigation, incorrect vulnerability assessment, and potential exploits. Although existing work has proposed several approaches for patch tracing, they suffer from two major challenges: (1) the lack of scalability to the full-repository level, and (2) the lack of study on how to model the semantic similarity between the CVE and the full diff code. Upon identifying this gap, we propose SITPatchTracer, a scalable full-repo full-context retrieval system for security vulnerability patch tracing. SITPatchTracer leverages ElasticSearch, learning-to-rank, and a hierarchical embedding approach based on GritLM, a top-ranked LLM for text embedding with unlimited context length and fast inference speed. The evaluation of SITPatchTracer shows that it achieves a high recall on both evaluated datasets. SITPatchTracer's recall not only outperforms several existing works (PatchFinder, PatchScout, VFCFinder), but also Voyage, the SOTA commercial code embedding API by 13% and 28%.

Problem

Research questions and friction points this paper is trying to address.

Tracing security vulnerability patches in large code repositories

Improving semantic similarity modeling between CVEs and code diffs

Enhancing scalability and efficiency of patch retrieval systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses ElasticSearch for scalable full-repo retrieval

Applies learning-to-rank for improved patch tracing

Leverages GritLM for hierarchical embedding approach

🔎 Similar Papers

No similar papers found.