🤖 AI Summary
Large language models (LLMs) exhibit high sensitivity to prompt formulations in vulnerability detection, yet this issue lacks systematic evaluation. This work proposes PromptAudit, a framework that treats prompt sensitivity as a first-order property of vulnerability detection. By controlling dataset, decoding, and parsing variables while varying only the prompting strategy, the study evaluates five prompting methods—including chain-of-thought, few-shot, and self-consistency—across five open-source LLMs on 6,074 CVE samples spanning 16 programming languages. Results show that standard chain-of-thought prompting achieves the best overall performance; few-shot prompting significantly improves performance for prompt-sensitive models; and adaptive chain-of-thought and self-consistency techniques reduce recall and induce excessive abstention, respectively.
📝 Abstract
Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains uncharacterized. We present PromptAudit, a controlled evaluation framework that isolates prompt effects by fixing the dataset, decoding, and parsing while varying only the prompting strategy. Using five prompting strategies across five open-weight models on 1,000 CVEs (6,074 code samples spanning 16 programming languages), we evaluate accuracy, recall, abstention, coverage, and effective F1. We find that standard chain-of-thought prompting achieves the strongest overall operational performance, while few-shot prompting provides model-dependent benefits that are most pronounced for prompt-sensitive models. In contrast, adaptive chain-of-thought frequently suppresses recall and self-consistency induces excessive abstention, sharply reducing effective performance. These results show that vulnerability detection behavior is jointly determined by the model and the prompt, and that prompt sensitivity is a first-class system property that must be explicitly characterized in evaluation and deployment.