🤖 AI Summary
Prior work on large language models (LLMs) for software vulnerability detection (SVD) is largely confined to C/C++ and single-strategy LLM optimization, lacking systematic, cross-language evaluation. Method: This paper introduces the first benchmark comprising 44,000 function-level vulnerable examples across Python, Java, and JavaScript; uniformly evaluates prompt engineering, instruction tuning, and sequence classification fine-tuning; proposes novel strategies—data undersampling for class balance and multi-LLM ensemble—and benchmarks against open-source LLMs, lightweight models, and mainstream static application security testing (SAST) tools. Contribution/Results: Empirical findings reveal persistent challenges for LLMs in SVD; our ensemble and balancing strategies improve F1-score by 12.3%; the study establishes the first reproducible, cross-language empirical benchmark and optimization framework for AI-driven software security practice.
📝 Abstract
Recent advancements in generative AI have led to the widespread adoption of large language models (LLMs) in software engineering, addressing numerous long-standing challenges. However, a comprehensive study examining the capabilities of LLMs in software vulnerability detection (SVD), a crucial aspect of software security, is currently lacking. Existing research primarily focuses on evaluating LLMs using C/C++ datasets. It typically explores only one or two strategies among prompt engineering, instruction tuning, and sequence classification fine-tuning for open-source LLMs. Consequently, there is a significant knowledge gap regarding the effectiveness of diverse LLMs in detecting vulnerabilities across various programming languages. To address this knowledge gap, we present a comprehensive empirical study evaluating the performance of LLMs on the SVD task. We have compiled a comprehensive dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. We assess five open-source LLMs using multiple approaches, including prompt engineering, instruction tuning, and sequence classification fine-tuning. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools. Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data perspective: Retraining models using downsampled balanced datasets. b) Model perspective: Investigating ensemble learning methods that combine predictions from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains a challenging task for LLMs. This study provides a thorough understanding of the role of LLMs in SVD and offers practical insights for future advancements in leveraging generative AI to enhance software security practices.