๐ค AI Summary
Existing Just-In-Time Vulnerability Prediction (JIT-VP) evaluations are severely unrealistic, relying solely on vulnerability-introducing or -fixing commits while ignoring the vast majority of vulnerability-neutral commitsโleading to substantial performance overestimation. Method: We propose a more realistic evaluation paradigm and introduce the first large-scale, publicly available dataset comprising over one million commits, explicitly mixing vulnerability-relevant and vulnerability-neutral instances. We systematically benchmark eight state-of-the-art JIT-VP models and address severe class imbalance via customized loss functions and diverse sampling strategies. Contribution/Results: Under realistic data distribution, the average PR-AUC of mainstream methods plummets from 0.805 to 0.016; none of the existing imbalance-mitigation techniques meaningfully alleviates this degradation. Our work exposes the fundamental failure of current JIT-VP approaches in practical settings and establishes a more credible, representative benchmark and critical reflection framework for future research.
๐ Abstract
Modern software systems are increasingly complex, presenting significant challenges in quality assurance. Just-in-time vulnerability prediction (JIT-VP) is a proactive approach to identifying vulnerable commits and providing early warnings about potential security risks. However, we observe that current JIT-VP evaluations rely on an idealized setting, where the evaluation datasets are artificially balanced, consisting exclusively of vulnerability-introducing and vulnerability-fixing commits.
To address this limitation, this study assesses the effectiveness of JIT-VP techniques under a more realistic setting that includes both vulnerability-related and vulnerability-neutral commits. To enable a reliable evaluation, we introduce a large-scale public dataset comprising over one million commits from FFmpeg and the Linux kernel. Our empirical analysis of eight state-of-the-art JIT-VP techniques reveals a significant decline in predictive performance when applied to real-world conditions; for example, the average PR-AUC on Linux drops 98% from 0.805 to 0.016. This discrepancy is mainly attributed to the severe class imbalance in real-world datasets, where vulnerability-introducing commits constitute only a small fraction of all commits.
To mitigate this issue, we explore the effectiveness of widely adopted techniques for handling dataset imbalance, including customized loss functions, oversampling, and undersampling. Surprisingly, our experimental results indicate that these techniques are ineffective in addressing the imbalance problem in JIT-VP. These findings underscore the importance of realistic evaluations of JIT-VP and the need for domain-specific techniques to address data imbalance in such scenarios.