Smart Cuts: Enhance Active Learning for Vulnerability Detection by Pruning Bad Seeds

📅 2025-06-25

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the performance degradation of vulnerability detection models caused by noisy training data, erroneous labels, and severe class imbalance, this paper proposes a data-quality-centric active learning framework. We introduce dataset mapping—a novel technique for vulnerability detection—and integrate DeepGini with K-Means to quantify sample learning difficulty, enabling identification and pruning of “bad seed” samples. Semantic features are extracted using CodeBERT to prioritize high-informativeness samples and automatically filter harmful ones. Evaluated on the Big-Vul dataset, our method achieves a 45.91% improvement in F1-score over random sampling and outperforms conventional active learning approaches by 61.46%. The framework significantly enhances model robustness, training stability, and efficiency while mitigating the adverse impact of low-quality data.

Technology Category

Application Category

📝 Abstract

Vulnerability detection is crucial for identifying security weaknesses in software systems. However, the effectiveness of machine learning models in this domain is often hindered by low-quality training datasets, which contain noisy, mislabeled, or imbalanced samples. This paper proposes a novel dataset maps-empowered approach that systematically identifies and mitigates hard-to-learn outliers, referred to as "bad seeds", to improve model training efficiency. Our approach can categorize training examples based on learning difficulty and integrate this information into an active learning framework. Unlike traditional methods that focus on uncertainty-based sampling, our strategy prioritizes dataset quality by filtering out performance-harmful samples while emphasizing informative ones. Our experimental results show that our approach can improve F1 score over random selection by 45.36% (DeepGini) and 45.91% (K-Means) and outperforms standard active learning by 61.46% (DeepGini) and 32.65% (K-Means) for CodeBERT on the Big-Vul dataset, demonstrating the effectiveness of integrating dataset maps for optimizing sample selection in vulnerability detection. Furthermore, our approach also enhances model robustness, improves sample selection by filtering bad seeds, and stabilizes active learning performance across iterations. By analyzing the characteristics of these outliers, we provide insights for future improvements in dataset construction, making vulnerability detection more reliable and cost-effective.

Problem

Research questions and friction points this paper is trying to address.

Improving vulnerability detection by pruning low-quality training samples

Enhancing active learning via dataset maps to filter bad seeds

Boosting model robustness and F1 score in security weakness identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes bad seeds to enhance training efficiency

Categorizes examples by learning difficulty for active learning

Improves F1 score significantly over traditional methods

🔎 Similar Papers

No similar papers found.