Enhancing Continual Learning for Software Vulnerability Prediction: Addressing Catastrophic Forgetting via Hybrid-Confidence-Aware Selective Replay for Temporal LLM Fine-Tuning

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the challenge of catastrophic forgetting in large language models when continuously detecting emerging vulnerabilities in temporally evolving code repositories. To mitigate this issue, the authors propose Hybrid-CASR, a continual fine-tuning approach applied to the Phi-2 model using bimonthly windows over a CVE-linked dataset spanning 2018–2024. The method integrates a selective replay mechanism that combines confidence-based and class-balanced criteria, prioritizing uncertain samples while dynamically preserving the ratio between vulnerability and fix functions. Experimental results demonstrate that Hybrid-CASR achieves a Macro-F1 score of 0.667 in forward evaluation—an improvement of +0.016 (p=0.026)—attains an Intransigent Backward Retention (IBR@1) of 0.741, and reduces training time by 17%, effectively balancing accuracy, robustness, and computational efficiency.

Technology Category

Application Category

📝 Abstract

Recent work applies Large Language Models (LLMs) to source-code vulnerability detection, but most evaluations still rely on random train-test splits that ignore time and overestimate real-world performance. In practice, detectors are deployed on evolving code bases and must recognise future vulnerabilities under temporal distribution shift. This paper investigates continual fine-tuning of a decoder-style language model (microsoft/phi-2 with LoRA) on a CVE-linked dataset spanning 2018-2024, organised into bi-monthly windows. We evaluate eight continual learning strategies, including window-only and cumulative training, replay-based baselines and regularisation-based variants. We propose Hybrid Class-Aware Selective Replay (Hybrid-CASR), a confidence-aware replay method for binary vulnerability classification that prioritises uncertain samples while maintaining a balanced ratio of VULNERABLE and FIXED functions in the replay buffer. On bi-monthly forward evaluation Hybrid-CASR achieves a Macro-F1 of 0.667, improving on the window-only baseline (0.651) by 0.016 with statistically significant gains ($p = 0.026$) and stronger backward retention (IBR@1 of 0.741). Hybrid-CASR also reduces training time per window by about 17 percent compared to the baseline, whereas cumulative training delivers only a minor F1 increase (0.661) at a 15.9-fold computational cost. Overall, the results show that selective replay with class balancing offers a practical accuracy-efficiency trade-off for LLM-based temporal vulnerability detection under continuous temporal drift.

Problem

Research questions and friction points this paper is trying to address.

continual learning

catastrophic forgetting

software vulnerability prediction

temporal distribution shift

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continual Learning

Catastrophic Forgetting

Selective Replay