On Selecting Few-Shot Examples for LLM-based Code Vulnerability Detection

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability of few-shot prompting for code vulnerability detection using large language models (LLMs). We propose a dual-criterion exemplar selection method that jointly models *error consistency*—the tendency of an LLM to commit consistent errors across similar inputs—and *k-NN similarity* in code semantic embedding space, thereby enhancing in-context learning efficacy. Unlike random sampling, our approach identifies high-informativeness “error-aware” exemplars through error pattern analysis and retrieves them precisely via semantic code embeddings. Extensive experiments across multiple state-of-the-art LLMs (e.g., CodeLlama, DeepSeek-Coder) and benchmark datasets (Devign, MultiVul) demonstrate that our method significantly improves vulnerability detection accuracy, yielding an average +5.2% F1-score gain. Moreover, the dual-criterion strategy consistently outperforms either criterion used in isolation, validating the effectiveness and generalizability of error-driven exemplar selection for code security tasks.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated impressive capabilities for many coding tasks, including summarization, translation, completion, and code generation. However, detecting code vulnerabilities remains a challenging task for LLMs. An effective way to improve LLM performance is in-context learning (ICL) - providing few-shot examples similar to the query, along with correct answers, can improve an LLM's ability to generate correct solutions. However, choosing the few-shot examples appropriately is crucial to improving model performance. In this paper, we explore two criteria for choosing few-shot examples for ICL used in the code vulnerability detection task. The first criterion considers if the LLM (consistently) makes a mistake or not on a sample with the intuition that LLM performance on a sample is informative about its usefulness as a few-shot example. The other criterion considers similarity of the examples with the program under query and chooses few-shot examples based on the $k$-nearest neighbors to the given sample. We perform evaluations to determine the benefits of these criteria individually as well as under various combinations, using open-source models on multiple datasets.
Problem

Research questions and friction points this paper is trying to address.

Selecting optimal few-shot examples for LLM vulnerability detection
Evaluating mistake-based and similarity-based example selection criteria
Improving code vulnerability detection through strategic in-context learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Select examples where LLM consistently makes mistakes
Choose examples based on k-nearest neighbors similarity
Combine mistake-based and similarity-based selection criteria
🔎 Similar Papers
No similar papers found.
M
Md Abdul Hannan
Colorado State University
R
Ronghao Ni
Carnegie Mellon University
C
Chi Zhang
Carnegie Mellon University
Limin Jia
Limin Jia
Carnegie Mellon University
Programming LanguagesFormal MethodsSecurity
Ravi Mangal
Ravi Mangal
Assistant Professor, Colorado State University
Trustworthy AIFormal MethodsMachine LearningSafe AutonomyProgram Verification
C
Corina S. Pasareanu
Carnegie Mellon University