🤖 AI Summary
This study addresses the lack of empirical understanding regarding the practical effectiveness of vulnerability impact version identification tools. We conduct the first systematic, large-scale empirical evaluation: constructing a high-quality benchmark dataset comprising 1,128 real-world C/C++ vulnerabilities and rigorously evaluating 12 state-of-the-art tools. Results reveal that the highest accuracy achieved by any individual tool is only 44.9%; ensemble strategies yield at most a 10.1-percentage-point improvement, with overall performance remaining below 60%. The primary bottlenecks are overreliance on heuristic rules and insufficient semantic reasoning capability. We identify critical issues—including patch-matching bias, root-cause distributions of false positives, and fundamental paradigm limitations—and propose concrete, actionable improvements. To foster reproducible research and advance the field, we publicly release both the benchmark dataset and the evaluation framework—establishing a foundational resource for next-generation vulnerability impact analysis.
📝 Abstract
Identifying which software versions are affected by a vulnerability is critical for patching, risk mitigation.Despite a growing body of tools, their real-world effectiveness remains unclear due to narrow evaluation scopes often limited to early SZZ variants, outdated techniques, and small or coarse-graineddatasets. In this paper, we present the first comprehensive empirical study of vulnerability affected versions identification. We curate a high quality benchmark of 1,128 real-world C/C++ vulnerabilities and systematically evaluate 12 representative tools from both tracing and matching paradigms across four dimensions: effectiveness at both vulnerability and version levels, root causes of false positives and negatives, sensitivity to patch characteristics, and ensemble potential. Our findings reveal fundamental limitations: no tool exceeds 45.0% accuracy, with key challenges stemming from heuristic dependence, limited semantic reasoning, and rigid matching logic. Patch structures such as add-only and cross-file changes further hinder performance. Although ensemble strategies can improve results by up to 10.1%, overall accuracy remains below 60.0%, highlighting the need for fundamentally new approaches. Moreover, our study offers actionable insights to guide tool development, combination strategies, and future research in this critical area. Finally, we release the replicated code and benchmark on our website to encourage future contributions.outdated techniques, and small or coarse grained datasets.