🤖 AI Summary
This study investigates the intrinsic capability of large language models (LLMs) to independently identify and comprehend code security vulnerabilities—without external tooling or auxiliary support.
Method: We introduce the first decoupled evaluation framework, employing ablation experiments to rigorously isolate and eliminate confounding factors—including knowledge retrieval, contextual augmentation, and prompt engineering—thereby enabling purified measurement of pure vulnerability reasoning ability. Our standardized evaluation spans 3,528 test scenarios across four mainstream LLM families (GPT-3.5/4, Phi-3, Llama 3) on a multi-language benchmark (Solidity, Java, C, C++) comprising 294 diverse, manually curated vulnerability samples.
Contribution/Results: We quantitatively characterize the differential impact of three common enhancement techniques on vulnerability detection performance—the first such analysis. Additionally, our framework uncovered 14 zero-day vulnerabilities, validated across four bug-bounty platforms and rewarded with USD 3,576. The core contribution is a controlled, reproducible paradigm for evaluating LLMs’ fundamental vulnerability reasoning capacity.
📝 Abstract
Large language models (LLMs) have demonstrated significant potential in various tasks, including those requiring human-level intelligence, such as vulnerability detection. However, recent efforts to use LLMs for vulnerability detection remain preliminary, as they lack a deep understanding of whether a subject LLM's vulnerability reasoning capability stems from the model itself or from external aids such as knowledge retrieval and tooling support. In this paper, we aim to decouple LLMs' vulnerability reasoning from other capabilities, such as vulnerability knowledge adoption, context information retrieval, and advanced prompt schemes. We introduce LLM4Vuln, a unified evaluation framework that separates and assesses LLMs' vulnerability reasoning capabilities and examines improvements when combined with other enhancements. We conduct controlled experiments using 147 ground-truth vulnerabilities and 147 non-vulnerable cases in Solidity, Java and C/C++, testing them in a total of 3,528 scenarios across four LLMs (GPT-3.5, GPT-4, Phi-3, and Llama 3). Our findings reveal the varying impacts of knowledge enhancement, context supplementation, and prompt schemes. We also identify 14 zero-day vulnerabilities in four pilot bug bounty programs, resulting in $3,576 in bounties.