🤖 AI Summary
Precise vulnerability function (VF) localization in open-source software remains challenging due to the absence of VF annotations in existing vulnerability databases, high noise and severe semantic gaps in traditional methods when patches are unavailable, and the fact that over 26% of VFs lie outside patched functions. Method: We propose VFArchē, a dual-mode framework unifying VF localization for both patched and unpatched scenarios: with patches, it jointly leverages call-chain reachability analysis and code-change mining; without patches, it combines vulnerability description semantics with cross-modal source-code similarity matching for fine-grained, source-level VF identification. Contribution/Results: VFArchē is the first to synergistically model call-graph analysis and multi-granularity semantic alignment, overcoming patch dependency and lexical mismatch bottlenecks. Experiments show it improves mean reciprocal rank (MRR) by 1.3–1.9× over state-of-the-art baselines, accurately localizes VFs in 43 out of 50 newly disclosed vulnerabilities, and reduces SCA false positives by 78%–89%.
📝 Abstract
Software Composition Analysis (SCA) has become pivotal in addressing vulnerabilities inherent in software project dependencies. In particular, reachability analysis is increasingly used in Open-Source Software (OSS) projects to identify reachable vulnerabilities (e.g., CVEs) through call graphs, enabling a focus on exploitable risks. Performing reachability analysis typically requires the vulnerable function (VF) to track the call chains from downstream applications. However, such crucial information is usually unavailable in modern vulnerability databases like NVD. While directly extracting VF from modified functions in vulnerability patches is intuitive, patches are not always available. Moreover, our preliminary study shows that over 26% of VF do not exist in the modified functions. Meanwhile, simply ignoring patches to search vulnerable functions suffers from overwhelming noises and lexical gaps between descriptions and source code. Given that almost half of the vulnerabilities are equipped with patches, a holistic solution that handles both scenarios with and without patches is required. To meet real-world needs and automatically localize VF, we present VFArchē, a dual-mode approach designed for disclosed vulnerabilities, applicable in scenarios with or without available patch links. The experimental results of VFArchē on our constructed benchmark dataset demonstrate significant efficacy regarding three metrics, achieving 1.3x and 1.9x Mean Reciprocal Rank over the best baselines for Patch-present and Patch-absent modes, respectively. Moreover, VFArchē has proven its applicability in real-world scenarios by successfully locating VF for 43 out of 50 latest vulnerabilities with reasonable efforts and significantly reducing 78-89% false positives of SCA tools.