Toward Realistic Evaluations of Just-In-Time Vulnerability Prediction

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing Just-In-Time Vulnerability Prediction (JIT-VP) evaluations are severely unrealistic, relying solely on vulnerability-introducing or -fixing commits while ignoring the vast majority of vulnerability-neutral commits—leading to substantial performance overestimation. Method: We propose a more realistic evaluation paradigm and introduce the first large-scale, publicly available dataset comprising over one million commits, explicitly mixing vulnerability-relevant and vulnerability-neutral instances. We systematically benchmark eight state-of-the-art JIT-VP models and address severe class imbalance via customized loss functions and diverse sampling strategies. Contribution/Results: Under realistic data distribution, the average PR-AUC of mainstream methods plummets from 0.805 to 0.016; none of the existing imbalance-mitigation techniques meaningfully alleviates this degradation. Our work exposes the fundamental failure of current JIT-VP approaches in practical settings and establishes a more credible, representative benchmark and critical reflection framework for future research.

Technology Category

Application Category

📝 Abstract

Modern software systems are increasingly complex, presenting significant challenges in quality assurance. Just-in-time vulnerability prediction (JIT-VP) is a proactive approach to identifying vulnerable commits and providing early warnings about potential security risks. However, we observe that current JIT-VP evaluations rely on an idealized setting, where the evaluation datasets are artificially balanced, consisting exclusively of vulnerability-introducing and vulnerability-fixing commits. To address this limitation, this study assesses the effectiveness of JIT-VP techniques under a more realistic setting that includes both vulnerability-related and vulnerability-neutral commits. To enable a reliable evaluation, we introduce a large-scale public dataset comprising over one million commits from FFmpeg and the Linux kernel. Our empirical analysis of eight state-of-the-art JIT-VP techniques reveals a significant decline in predictive performance when applied to real-world conditions; for example, the average PR-AUC on Linux drops 98% from 0.805 to 0.016. This discrepancy is mainly attributed to the severe class imbalance in real-world datasets, where vulnerability-introducing commits constitute only a small fraction of all commits. To mitigate this issue, we explore the effectiveness of widely adopted techniques for handling dataset imbalance, including customized loss functions, oversampling, and undersampling. Surprisingly, our experimental results indicate that these techniques are ineffective in addressing the imbalance problem in JIT-VP. These findings underscore the importance of realistic evaluations of JIT-VP and the need for domain-specific techniques to address data imbalance in such scenarios.

Problem

Research questions and friction points this paper is trying to address.

Evaluating JIT-VP in realistic unbalanced commit datasets

Assessing performance drop of JIT-VP in real-world conditions

Exploring ineffective imbalance-handling techniques for JIT-VP

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses realistic dataset with vulnerability-neutral commits

Introduces large-scale public dataset for evaluation

Explores imbalance-handling techniques for JIT-VP

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

Praktikum Methoden zur Validierung und Absicherung von KI-Modellen

Bosch Group

Elchingen, BY, DE

Software Engineer, Machine Learning