🤖 AI Summary
Existing automated vulnerability detection (SVD) methods suffer from insufficient context-aware robustness in complex software systems, high noise in evaluation data, and poor zero-shot generalization. To address these challenges, this paper proposes VulnSage: (1) the first high-fidelity, low-noise, multi-granularity (function-/file-/cross-component) C/C++ vulnerability benchmark dataset; (2) novel zero-shot structured reasoning prompting paradigms (e.g., Think&Verify), reducing ambiguous response rates from 20.3% to 9.1% and improving detection accuracy; (3) a heuristic pre-filtering and LLM-coordinated data purification mechanism; and (4) the first systematic empirical analysis revealing performance divergence of code-specific models across vulnerability types—demonstrating their consistent superiority over general-purpose LLMs. Evaluated on real-world open-source systems, VulnSage achieves effective cross-component vulnerability identification.
📝 Abstract
Automating software vulnerability detection (SVD) remains a critical challenge in an era of increasingly complex and interdependent software systems. Despite significant advances in Large Language Models (LLMs) for code analysis, prevailing evaluation methodologies often lack the extbf{context-aware robustness} necessary to capture real-world intricacies and cross-component interactions. To address these limitations, we present extbf{VulnSage}, a comprehensive evaluation framework and a dataset curated from diverse, large-scale open-source system software projects developed in C/C++. Unlike prior datasets, it leverages a heuristic noise pre-filtering approach combined with LLM-based reasoning to ensure a representative and minimally noisy spectrum of vulnerabilities. The framework supports multi-granular analysis across function, file, and inter-function levels and employs four diverse zero-shot prompt strategies: Baseline, Chain-of-Thought, Think, and Think&Verify. Through this evaluation, we uncover that structured reasoning prompts substantially improve LLM performance, with Think&Verify reducing ambiguous responses from 20.3% to 9.1% while increasing accuracy. We further demonstrate that code-specialized models consistently outperform general-purpose alternatives, with performance varying significantly across vulnerability types, revealing that no single approach universally excels across all security contexts. Link to dataset and codes: https://github.com/Erroristotle/VulnSage.git