From LLMs to Agents: A Comparative Evaluation of LLMs and LLM-based Agents in Security Patch Detection

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Silent security patches in open-source software (OSS) are frequently overlooked, exacerbating supply-chain security risks. This work presents the first systematic evaluation of large language models (LLMs) and LLM-based agents for detecting such patches, analyzing performance across three dimensions: vulnerability types, prompting strategies, and context window sizes. We propose and compare three approaches: (1) a baseline LLM, (2) a data-augmented LLM, and (3) a ReAct-style reasoning agent. Results show that the data-augmented LLM achieves the best overall performance, while the ReAct agent significantly reduces false positives—by 32.7% relative to the baseline—without compromising accuracy. Our methodology integrates prompt engineering, context window analysis, and cross-model validation using both open-source and commercial LLMs. Findings demonstrate that combining data augmentation with structured reasoning substantially improves detection robustness. This study establishes a practical, empirically grounded paradigm for identifying silent security patches in OSS ecosystems.

Technology Category

Application Category

📝 Abstract
The widespread adoption of open-source software (OSS) has accelerated software innovation but also increased security risks due to the rapid propagation of vulnerabilities and silent patch releases. In recent years, large language models (LLMs) and LLM-based agents have demonstrated remarkable capabilities in various software engineering (SE) tasks, enabling them to effectively address software security challenges such as vulnerability detection. However, systematic evaluation of the capabilities of LLMs and LLM-based agents in security patch detection remains limited. To bridge this gap, we conduct a comprehensive evaluation of the performance of LLMs and LLM-based agents for security patch detection. Specifically, we investigate three methods: Plain LLM (a single LLM with a system prompt), Data-Aug LLM (data augmentation based on the Plain LLM), and the ReAct Agent (leveraging the thought-action-observation mechanism). We also evaluate the performance of both commercial and open-source LLMs under these methods and compare these results with those of existing baselines. Furthermore, we analyze the detection performance of these methods across various vulnerability types, and examine the impact of different prompting strategies and context window sizes on the results. Our findings reveal that the Data-Aug LLM achieves the best overall performance, whereas the ReAct Agent demonstrates the lowest false positive rate (FPR). Although baseline methods exhibit strong accuracy, their false positive rates are significantly higher. In contrast, our evaluated methods achieve comparable accuracy while substantially reducing the FPR. These findings provide valuable insights into the practical applications of LLMs and LLM-based agents in security patch detection, highlighting their advantage in maintaining robust performance while minimizing false positive rates.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs and LLM-based agents for security patch detection capabilities
Comparing performance across different methods including Plain LLM and ReAct Agent
Analyzing detection accuracy and false positive rates across vulnerability types
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-Aug LLM achieves best overall performance
ReAct Agent demonstrates lowest false positive rate
Methods maintain accuracy while reducing false positives
🔎 Similar Papers
No similar papers found.