PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Large language models (LLMs) exhibit high sensitivity to prompt formulations in vulnerability detection, yet this issue lacks systematic evaluation. This work proposes PromptAudit, a framework that treats prompt sensitivity as a first-order property of vulnerability detection. By controlling dataset, decoding, and parsing variables while varying only the prompting strategy, the study evaluates five prompting methods—including chain-of-thought, few-shot, and self-consistency—across five open-source LLMs on 6,074 CVE samples spanning 16 programming languages. Results show that standard chain-of-thought prompting achieves the best overall performance; few-shot prompting significantly improves performance for prompt-sensitive models; and adaptive chain-of-thought and self-consistency techniques reduce recall and induce excessive abstention, respectively.

📝 Abstract

Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains uncharacterized. We present PromptAudit, a controlled evaluation framework that isolates prompt effects by fixing the dataset, decoding, and parsing while varying only the prompting strategy. Using five prompting strategies across five open-weight models on 1,000 CVEs (6,074 code samples spanning 16 programming languages), we evaluate accuracy, recall, abstention, coverage, and effective F1. We find that standard chain-of-thought prompting achieves the strongest overall operational performance, while few-shot prompting provides model-dependent benefits that are most pronounced for prompt-sensitive models. In contrast, adaptive chain-of-thought frequently suppresses recall and self-consistency induces excessive abstention, sharply reducing effective performance. These results show that vulnerability detection behavior is jointly determined by the model and the prompt, and that prompt sensitivity is a first-class system property that must be explicitly characterized in evaluation and deployment.

Problem

Research questions and friction points this paper is trying to address.

prompt sensitivity

vulnerability detection

large language models

prompting strategies

reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt Sensitivity

Vulnerability Detection

Large Language Models