LLM-based Vulnerability Detection at Project Scale: An Empirical Study

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study systematically evaluates the effectiveness, reliability, and scalability of large language models (LLMs) for project-level vulnerability detection. By constructing a benchmark comprising 222 real-world vulnerabilities and empirically comparing five LLM-based approaches against two traditional static analysis tools across 24 active open-source projects, the work reveals critical limitations of current LLM detectors at scale—namely low recall, high false-positive rates, and exorbitant computational costs (reaching hundreds of millions of tokens and requiring days of runtime). The analysis further identifies fundamental failure modes, including insufficient shallow interprocedural reasoning and misidentification of sources and sinks. Although LLMs occasionally uncover unique vulnerabilities missed by conventional tools, their overall practical utility remains limited, thereby highlighting key directions for future research.

Technology Category

Application Category

📝 Abstract

In this paper, we present the first comprehensive empirical study of specialized LLM-based detectors and compare them with traditional static analyzers at the project scale. Specifically, our study evaluates five latest and representative LLM-based methods and two traditional tools using: 1) an in-house benchmark of 222 known real-world vulnerabilities (C/C++ and Java) to assess detection capability, and 2) 24 active open-source projects, where we manually inspected 385 warnings to assess their practical usability and underlying root causes of failures. Our evaluation yields three key findings: First, while LLM-based detectors exhibit low recall on the in-house benchmark, they still uncover more unique vulnerabilities than traditional tools. Second, in open-source projects, both LLM-based and traditional tools generate substantial warnings but suffer from very high false discovery rates, hindering practical use. Our manual analysis further reveals shallow interprocedural reasoning and misidentified source/sink pairs as primary failure causes, with LLM-based tools exhibiting additional unique failures. Finally, LLM-based methods incurs substantial computational costs-hundreds of thousands to hundreds of millions of tokens and multi-hour to multi-day runtimes. Overall, our findings underscore critical limitations in the robustness, reliability, and scalability of current LLM-based detectors. We ultimately summarize a set of implications for future research toward more effective and practical project-scale vulnerability detection.

Problem

Research questions and friction points this paper is trying to address.

LLM-based vulnerability detection

project-scale evaluation

false discovery rate

scalability

empirical study

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based vulnerability detection

empirical study

project-scale analysis