π€ AI Summary
This study systematically evaluates the effectiveness, reliability, and scalability of large language models (LLMs) for project-level vulnerability detection. By constructing a benchmark comprising 222 real-world vulnerabilities and empirically comparing five LLM-based approaches against two traditional static analysis tools across 24 active open-source projects, the work reveals critical limitations of current LLM detectors at scaleβnamely low recall, high false-positive rates, and exorbitant computational costs (reaching hundreds of millions of tokens and requiring days of runtime). The analysis further identifies fundamental failure modes, including insufficient shallow interprocedural reasoning and misidentification of sources and sinks. Although LLMs occasionally uncover unique vulnerabilities missed by conventional tools, their overall practical utility remains limited, thereby highlighting key directions for future research.
π Abstract
In this paper, we present the first comprehensive empirical study of specialized LLM-based detectors and compare them with traditional static analyzers at the project scale. Specifically, our study evaluates five latest and representative LLM-based methods and two traditional tools using: 1) an in-house benchmark of 222 known real-world vulnerabilities (C/C++ and Java) to assess detection capability, and 2) 24 active open-source projects, where we manually inspected 385 warnings to assess their practical usability and underlying root causes of failures. Our evaluation yields three key findings: First, while LLM-based detectors exhibit low recall on the in-house benchmark, they still uncover more unique vulnerabilities than traditional tools. Second, in open-source projects, both LLM-based and traditional tools generate substantial warnings but suffer from very high false discovery rates, hindering practical use. Our manual analysis further reveals shallow interprocedural reasoning and misidentified source/sink pairs as primary failure causes, with LLM-based tools exhibiting additional unique failures. Finally, LLM-based methods incurs substantial computational costs-hundreds of thousands to hundreds of millions of tokens and multi-hour to multi-day runtimes. Overall, our findings underscore critical limitations in the robustness, reliability, and scalability of current LLM-based detectors. We ultimately summarize a set of implications for future research toward more effective and practical project-scale vulnerability detection.