Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Prior work on large language models (LLMs) for software vulnerability detection (SVD) is largely confined to C/C++ and single-strategy LLM optimization, lacking systematic, cross-language evaluation. Method: This paper introduces the first benchmark comprising 44,000 function-level vulnerable examples across Python, Java, and JavaScript; uniformly evaluates prompt engineering, instruction tuning, and sequence classification fine-tuning; proposes novel strategies—data undersampling for class balance and multi-LLM ensemble—and benchmarks against open-source LLMs, lightweight models, and mainstream static application security testing (SAST) tools. Contribution/Results: Empirical findings reveal persistent challenges for LLMs in SVD; our ensemble and balancing strategies improve F1-score by 12.3%; the study establishes the first reproducible, cross-language empirical benchmark and optimization framework for AI-driven software security practice.

Technology Category

Application Category

📝 Abstract

Recent advancements in generative AI have led to the widespread adoption of large language models (LLMs) in software engineering, addressing numerous long-standing challenges. However, a comprehensive study examining the capabilities of LLMs in software vulnerability detection (SVD), a crucial aspect of software security, is currently lacking. Existing research primarily focuses on evaluating LLMs using C/C++ datasets. It typically explores only one or two strategies among prompt engineering, instruction tuning, and sequence classification fine-tuning for open-source LLMs. Consequently, there is a significant knowledge gap regarding the effectiveness of diverse LLMs in detecting vulnerabilities across various programming languages. To address this knowledge gap, we present a comprehensive empirical study evaluating the performance of LLMs on the SVD task. We have compiled a comprehensive dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. We assess five open-source LLMs using multiple approaches, including prompt engineering, instruction tuning, and sequence classification fine-tuning. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools. Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data perspective: Retraining models using downsampled balanced datasets. b) Model perspective: Investigating ensemble learning methods that combine predictions from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains a challenging task for LLMs. This study provides a thorough understanding of the role of LLMs in SVD and offers practical insights for future advancements in leveraging generative AI to enhance software security practices.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for software vulnerability detection across multiple languages.

Addressing knowledge gaps in LLM effectiveness for diverse programming languages.

Exploring methods to improve LLM performance in vulnerability detection tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs using multiple programming languages.

Uses prompt engineering, instruction tuning, fine-tuning.

Explores ensemble learning and balanced datasets.

🔎 Similar Papers

No similar papers found.