🤖 AI Summary
This study addresses the significant performance degradation of existing security patch detection models when applied to real-world scenarios, where they struggle to identify patches for undisclosed vulnerabilities. The authors systematically evaluate detectors trained on National Vulnerability Database (NVD) data and reveal substantial discrepancies between NVD and real-world patches in terms of commit messages, vulnerability types, and code change distributions, highlighting the limitations of relying solely on NVD for training. To mitigate this issue, they propose a hybrid augmentation strategy that integrates a small set of manually labeled real-world patches with NVD data. Experimental results demonstrate that models trained exclusively on NVD data suffer F1 score drops of up to 90% on real-world data, whereas incorporating even limited manually annotated patches substantially enhances both practical utility and generalization capability.
📝 Abstract
Attacks can exploit zero-day or one-day vulnerabilities that are not publicly disclosed. To detect these vulnerabilities, security researchers monitor development activities in open-source repositories to identify unreported security patches. The sheer volume of commits makes this task infeasible to accomplish manually. Consequently, security patch detectors commonly trained and evaluated on security patches linked from vulnerability reports in the National Vulnerability Database (NVD). In this study, we assess the effectiveness of these detectors when applied in-the-wild. Our results show that models trained on NVD-derived data show substantially decreased performance, with decreases in F1-score of up to 90\% when tested on in-the-wild security patches, rendering them impractical for real-world use. An analysis comparing security patches identified in-the-wild and commits linked from NVD reveals that they can be easily distinguished from each other. Security patches associated with NVD have different distribution of commit messages, vulnerability types, and composition of changes. These differences suggest that NVD may be unsuitable as the \textit{sole} source of data for training models to detect security patches. We find that constructing a dataset that combines security patches from NVD data with a small subset of manually identified security patches can improve model robustness.