🤖 AI Summary
This study addresses the performance instability of deep learning models in software vulnerability detection, which primarily stems from extreme class imbalance due to the scarcity of vulnerable samples. For the first time, it systematically validates class imbalance as the core factor driving such performance fluctuations. The authors empirically evaluate the impact of prominent class-imbalance mitigation strategies—including Focal Loss, mean squared error, class-balanced loss, and random oversampling—across nine open-source datasets and two state-of-the-art deep learning models. Their findings reveal that Focal Loss significantly improves precision, mean squared error and class-balanced loss yield better recall, and random oversampling achieves the highest F1 score. However, no single method consistently outperforms others across all evaluation metrics.
📝 Abstract
Vulnerability detection is crucial to protect software security. Nowadays, deep learning (DL) is the most promising technique to automate this detection task, leveraging its superior ability to extract patterns and representations within extensive code volumes. Despite its promise, DL-based vulnerability detection remains in its early stages, with model performance exhibiting variability across datasets. Drawing insights from other well-explored application areas like computer vision, we conjecture that the imbalance issue (the number of vulnerable code is extremely small) is at the core of the phenomenon. To validate this, we conduct a comprehensive empirical study involving nine open-source datasets and two state-of-the-art DL models. The results confirm our conjecture. We also obtain insightful findings on how existing imbalance solutions perform in vulnerability detection. It turns out that these solutions perform differently as well across datasets and evaluation metrics. Specifically: 1) Focal loss is more suitable to improve the precision, 2) mean false error and class-balanced loss encourages the recall, and 3) random over-sampling facilitates the F1-measure. However, none of them excels across all metrics. To delve deeper, we explore external influences on these solutions and offer insights for developing new solutions.