Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection

📅 2025-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vulnerability detection benchmarks focus on single-vulnerability or function-level classification, failing to capture the complexity of real-world software where multiple vulnerabilities coexist and interact; moreover, large language models (LLMs) exhibit unquantified “counting bias” and “selection bias” in multi-label detection. Method: We introduce the first cross-language (C/C++/Python/JavaScript), long-context (7.5k–10k tokens) benchmark for multi-vulnerability collaborative detection, featuring a controllable vulnerability-density injection methodology and a novel multi-label evaluation framework. Leveraging CodeParrot, we construct a high-quality dataset and conduct systematic evaluation across five state-of-the-art models using F1-score, recall, and precision. Results: Llama-3.3-70B achieves 0.97 F1 on single-vulnerability C tasks but suffers >40% degradation under high-density (9-vulnerability) settings; Python and JavaScript recall falls below 0.30, exposing pronounced language-specific failures. This work is the first to quantify and empirically validate core LLM biases in multi-vulnerability scenarios.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated significant potential in automated software security, particularly in vulnerability detection. However, existing benchmarks primarily focus on isolated, single-vulnerability samples or function-level classification, failing to reflect the complexity of real-world software where multiple interacting vulnerabilities often coexist within large files. Recent studies indicate that LLMs suffer from "count bias" and "selection bias" in multi-label tasks, yet this has not been rigorously quantified in the domain of code security. In this work, we introduce a comprehensive benchmark for Multi-Vulnerability Detection across four major languages: C, C++, Python, and JavaScript. We construct a dataset of 40,000 files by systematically injecting controlled counts of vulnerabilities (1, 3, 5, and 9) into long-context code samples (7.5k-10k tokens) sourced from CodeParrot. We evaluate five state-of-the-art LLMs, including GPT-4o-mini, Llama-3.3-70B, and the Qwen-2.5 series. Our results reveal a sharp degradation in performance as vulnerability density increases. While Llama-3.3-70B achieves near-perfect F1 scores (approximately 0.97) on single-vulnerability C tasks, performance drops by up to 40% in high-density settings. Notably, Python and JavaScript show distinct failure modes compared to C/C++, with models exhibiting severe "under-counting" (Recall dropping to less than 0.30) in complex Python files.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs for detecting multiple vulnerabilities in large code files
Addresses performance degradation with increasing vulnerability density
Compares failure modes across C, C++, Python, and JavaScript languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces multi-vulnerability detection benchmark across four programming languages
Systematically injects controlled vulnerability counts into long-context code samples
Evaluates LLM performance degradation with increasing vulnerability density
🔎 Similar Papers
No similar papers found.