🤖 AI Summary
This work addresses the challenge of catastrophic forgetting in large language models when continuously detecting emerging vulnerabilities in temporally evolving code repositories. To mitigate this issue, the authors propose Hybrid-CASR, a continual fine-tuning approach applied to the Phi-2 model using bimonthly windows over a CVE-linked dataset spanning 2018–2024. The method integrates a selective replay mechanism that combines confidence-based and class-balanced criteria, prioritizing uncertain samples while dynamically preserving the ratio between vulnerability and fix functions. Experimental results demonstrate that Hybrid-CASR achieves a Macro-F1 score of 0.667 in forward evaluation—an improvement of +0.016 (p=0.026)—attains an Intransigent Backward Retention (IBR@1) of 0.741, and reduces training time by 17%, effectively balancing accuracy, robustness, and computational efficiency.
📝 Abstract
Recent work applies Large Language Models (LLMs) to source-code vulnerability detection, but most evaluations still rely on random train-test splits that ignore time and overestimate real-world performance. In practice, detectors are deployed on evolving code bases and must recognise future vulnerabilities under temporal distribution shift. This paper investigates continual fine-tuning of a decoder-style language model (microsoft/phi-2 with LoRA) on a CVE-linked dataset spanning 2018-2024, organised into bi-monthly windows. We evaluate eight continual learning strategies, including window-only and cumulative training, replay-based baselines and regularisation-based variants. We propose Hybrid Class-Aware Selective Replay (Hybrid-CASR), a confidence-aware replay method for binary vulnerability classification that prioritises uncertain samples while maintaining a balanced ratio of VULNERABLE and FIXED functions in the replay buffer. On bi-monthly forward evaluation Hybrid-CASR achieves a Macro-F1 of 0.667, improving on the window-only baseline (0.651) by 0.016 with statistically significant gains ($p = 0.026$) and stronger backward retention (IBR@1 of 0.741). Hybrid-CASR also reduces training time per window by about 17 percent compared to the baseline, whereas cumulative training delivers only a minor F1 increase (0.661) at a 15.9-fold computational cost. Overall, the results show that selective replay with class balancing offers a practical accuracy-efficiency trade-off for LLM-based temporal vulnerability detection under continuous temporal drift.