From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection

📅 2026-02-15

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the lack of systematic post-training methodologies for large language models (LLMs) in vulnerability detection, which has led to inefficient training and unreliable evaluation. We present the first comprehensive exploration of a full post-training pipeline—spanning supervised fine-tuning (SFT), off-policy preference optimization (DPO/ORPO), and online reinforcement learning (GRPO)—augmented with a fine-grained root-cause-based reward mechanism and an LLM-as-a-Judge evaluation protocol. These innovations effectively mitigate hallucination and performance overestimation. Experimental results demonstrate that our GRPO-trained model significantly outperforms SFT and preference optimization baselines, as well as zero-shot state-of-the-art LLMs, achieving superior vulnerability detection performance under a more reliable evaluation framework.

Technology Category

Application Category

📝 Abstract

The integration of LLMs into vulnerability detection (VD) has shifted the field toward interpretable and context-aware analysis. While post-training methods have shown promise in general coding tasks, their systematic application to VD remains underexplored. In this paper, we present the first comprehensive investigation into the post-training pipeline for LLM-based VD, spanning from cold-start SFT to off-policy preference optimization and on-policy RL, uncovering how data curation, stage interactions, reward mechanisms, and evaluation protocols collectively dictate the efficacy of model training and assessment. Our study identifies practical guidelines and insights: (1) SFT based on rejection sampling greatly outperforms rationalization-based supervision, which can introduce hallucinations due to ground-truth leakage. (2) While increased SFT epochs constantly benefit preference optimization, excessive SFT inhibits self-exploration during RL, ultimately limiting performance gains. (3) Coarse-grained reward signals often mislead RL, whereas fine-grained root-cause judgments ensure reliable credit assignment. Specification-based rewards offer further benefits but incur significant effort in specification generation. (4) Although filtering extremely hard-to-detect vulnerability samples improves RL training efficiency, the cost of performance loss should be considered in practical applications. (5) Models trained under GRPO significantly outperform those using SFT and preference optimization (i.e., DPO and ORPO), as well as a series of zero-shot SOTA LLMs, underscoring the significant potential of on-policy RL for LLM-based VD. (6) In contrast to binary matching that tends to overestimate performance, LLM-as-a-Judge based on root-cause analysis provides a more robust evaluation protocol, although its accuracy varies across judge models with different levels of security expertise.

Problem

Research questions and friction points this paper is trying to address.

vulnerability detection

post-training

large language models

reinforcement learning

supervised fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Supervised Fine-Tuning

Vulnerability Detection