🤖 AI Summary
This work addresses the limitations of existing vulnerability detection methods, which are predominantly confined to binary classification and often produce large language model (LLM)-generated explanations misaligned with Common Weakness Enumeration (CWE) semantics, resulting in poor interpretability and insufficient fine-grained categorization. To overcome these challenges, we propose VulReaD, the first approach that integrates a security knowledge graph with contrastive reasoning to construct semantic skeletons. Leveraging a teacher LLM, our method generates CWE-aligned contrastive reasoning signals as supervision, guiding unsupervised fine-tuning of a student model via ORPO optimization—without requiring manual annotations. Evaluated on three real-world datasets, VulReaD improves binary classification F1 by 8–10% and achieves substantial gains in multi-class settings, with Macro-F1 and Micro-F1 scores increasing by 30% and 18%, respectively, while significantly enhancing CWE coverage and explanation consistency over state-of-the-art methods.
📝 Abstract
Software vulnerability detection (SVD) is a critical challenge in modern systems. Large language models (LLMs) offer natural-language explanations alongside predictions, but most work focuses on binary evaluation, and explanations often lack semantic consistency with Common Weakness Enumeration (CWE) categories. We propose VulReaD, a knowledge-graph-guided approach for vulnerability reasoning and detection that moves beyond binary classification toward CWE-level reasoning. VulReaD leverages a security knowledge graph (KG) as a semantic backbone and uses a strong teacher LLM to generate CWE-consistent contrastive reasoning supervision, enabling student model training without manual annotations. Students are fine-tuned with Odds Ratio Preference Optimization (ORPO) to encourage taxonomy-aligned reasoning while suppressing unsupported explanations. Across three real-world datasets, VulReaD improves binary F1 by 8-10% and multi-class classification by 30% Macro-F1 and 18% Micro-F1 compared to state-of-the-art baselines. Results show that LLMs outperform deep learning baselines in binary detection and that KG-guided reasoning enhances CWE coverage and interpretability.