Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing vulnerability detection benchmarks (e.g., Devign, BigVul) operate at the function level and lack support for realistic, repository-scale interprocedural analysis. Method: We propose JitVul—the first CVE-granularity benchmark designed for just-in-time (JIT) patching—covering 879 CVEs across 91 vulnerability types. It uniquely establishes precise mappings between vulnerability-introducing/fixing commits and affected functions, enabling fine-grained cross-call-chain attribution and repair-pair evaluation. We design a ReAct-based agent integrating Thought-Action-Observation reasoning, chain-of-thought prompting, and repository-level dependency-aware context retrieval. Results: Experiments show the agent significantly outperforms standalone LLMs in vulnerability identification and benign-code discrimination. However, persistent false positives (e.g., misclassifying security checks as vulnerabilities) and false negatives highlight practical challenges in real-world deployment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown promise in software vulnerability detection, particularly on function-level benchmarks like Devign and BigVul. However, real-world detection requires interprocedural analysis, as vulnerabilities often emerge through multi-hop function calls rather than isolated functions. While repository-level benchmarks like ReposVul and VulEval introduce interprocedural context, they remain computationally expensive, lack pairwise evaluation of vulnerability fixes, and explore limited context retrieval, limiting their practicality. We introduce JitVul, a JIT vulnerability detection benchmark linking each function to its vulnerability-introducing and fixing commits. Built from 879 CVEs spanning 91 vulnerability types, JitVul enables comprehensive evaluation of detection capabilities. Our results show that ReAct Agents, leveraging thought-action-observation and interprocedural context, perform better than LLMs in distinguishing vulnerable from benign code. While prompting strategies like Chain-of-Thought help LLMs, ReAct Agents require further refinement. Both methods show inconsistencies, either misidentifying vulnerabilities or over-analyzing security guards, indicating significant room for improvement.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs in real-world vulnerability detection across code repositories.

Address limitations of existing benchmarks in interprocedural analysis and practicality.

Introduce JitVul for comprehensive vulnerability detection and fix evaluation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

JitVul benchmark links functions to vulnerability commits

ReAct Agents outperform LLMs in vulnerability detection

Interprocedural context improves practical vulnerability detection

🔎 Similar Papers

No similar papers found.