VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vulnerability detection methods—both traditional machine learning and large language model (LLM)-based approaches—are constrained by static inputs, fixed preference priors, and function-level benchmarks, limiting their ability to model repository-scale dependencies and critical contextual information. To address this, we propose VULPO, a policy-based reinforcement learning framework that dynamically explores cross-file, repository-level dependencies for context-sensitive vulnerability identification. Our key contributions are: (1) ContextVul, the first benchmark dataset enabling context-aware training and evaluation; (2) a multi-dimensional reward mechanism incorporating label- and sample-level difficulty-adaptive reward scaling; and (3) joint extraction of both function-level and repository-level contextual features. Experiments show that VULPO-4B achieves an 85% F1-score improvement over Qwen3-4B and matches the performance of DeepSeek-R1-0528—a model 150× larger—achieving state-of-the-art results across mainstream benchmarks.

Technology Category

Application Category

📝 Abstract
The widespread reliance on open-source software dramatically increases the risk of vulnerability exploitation, underscoring the need for effective and scalable vulnerability detection (VD). Existing VD techniques, whether traditional machine learning-based or LLM-based approaches like prompt engineering, supervised fine-tuning, or off-policy preference optimization, remain fundamentally limited in their ability to perform context-aware analysis: They depend on fixed inputs or static preference datasets, cannot adaptively explore repository-level dependencies, and are constrained by function-level benchmarks that overlook critical vulnerability context. This paper introduces Vulnerability-Adaptive Policy Optimization (VULPO), an on-policy LLM reinforcement learning framework for context-aware VD. To support training and evaluation, we first construct ContextVul, a new dataset that augments high-quality function-level samples with lightweight method to extract repository-level context information. We then design multi-dimensional reward structuring that jointly captures prediction correctness, vulnerability localization accuracy, and the semantic relevance of vulnerability analysis, thereby guiding the model toward comprehensive contextual reasoning. To address the asymmetric difficulty of different vulnerability cases and mitigate reward hacking, VULPO incorporates label-level and sample-level difficulty-adaptive reward scaling, encouraging the model to explore challenging cases while maintaining balanced reward distribution. Extensive experiments demonstrate the superiority of our VULPO framework in context-aware VD: Our VULPO-4B substantially outperforms existing VD baselines based on prompt engineering and off-policy optimization, improving F1 by 85% over Qwen3-4B and achieving performance comparable to a 150x larger-scale model, DeepSeek-R1-0528.
Problem

Research questions and friction points this paper is trying to address.

Detects software vulnerabilities using context-aware LLM reinforcement learning
Overcomes limitations of static analysis by exploring repository-level dependencies
Addresses asymmetric difficulty in vulnerability cases through adaptive reward scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

On-policy LLM reinforcement learning for vulnerability detection
Multi-dimensional reward structuring for contextual reasoning
Difficulty-adaptive reward scaling for challenging vulnerability cases
🔎 Similar Papers
No similar papers found.