Towards Effective Complementary Security Analysis using Large Language Models

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the high false positive rate of Static Application Security Testing (SAST) tools and the associated high manual verification cost, this paper proposes an LLM-based automated false positive identification and filtering method. Our approach systematically validates, for the first time, the effectiveness of large language models in detecting SAST false positives while strictly guaranteeing 100% true positive preservation. We introduce a novel multi-model collaborative detection framework integrating Chain-of-Thought and Self-Consistency prompting strategies. Experiments are conducted on OWASP Benchmark v1.2 and real-world, multi-language, multi-tool project datasets. Results show that a single LLM achieves 62.5% false positive identification accuracy on OWASP Benchmark, improving to 78.9% with multi-model fusion; on real-world data, the rates are 33.85% and 38.46%, respectively. This significantly enhances both the efficiency and accuracy of security analysis.

Technology Category

Application Category

📝 Abstract

A key challenge in security analysis is the manual evaluation of potential security weaknesses generated by static application security testing (SAST) tools. Numerous false positives (FPs) in these reports reduce the effectiveness of security analysis. We propose using Large Language Models (LLMs) to improve the assessment of SAST findings. We investigate the ability of LLMs to reduce FPs while trying to maintain a perfect true positive rate, using datasets extracted from the OWASP Benchmark (v1.2) and a real-world software project. Our results indicate that advanced prompting techniques, such as Chain-of-Thought and Self-Consistency, substantially improve FP detection. Notably, some LLMs identified approximately 62.5% of FPs in the OWASP Benchmark dataset without missing genuine weaknesses. Combining detections from different LLMs would increase this FP detection to approximately 78.9%. Additionally, we demonstrate our approach's generalizability using a real-world dataset covering five SAST tools, three programming languages, and infrastructure files. The best LLM detected 33.85% of all FPs without missing genuine weaknesses, while combining detections from different LLMs would increase this detection to 38.46%. Our findings highlight the potential of LLMs to complement traditional SAST tools, enhancing automation and reducing resources spent addressing false alarms.

Problem

Research questions and friction points this paper is trying to address.

Reducing false positives in SAST reports using LLMs

Maintaining true positive rate while improving FP detection

Enhancing security analysis automation with advanced LLM techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs reduce false positives in SAST reports

Advanced prompting techniques enhance FP detection

Combining multiple LLMs improves detection accuracy

🔎 Similar Papers

Large Language Models for Cyber Security: A Systematic Literature Review