🤖 AI Summary
To address the class imbalance problem in C/C++ vulnerability detection—which leads to high false-positive rates and poor interpretability—this paper proposes an edge-aware graph attention model based on Code Property Graphs (CPGs). Methodologically: (i) we construct a CPG integrating syntactic, control-flow, and data-flow information; (ii) we design a dual-channel node embedding (structural + semantic) coupled with an edge-type-aware attention mechanism to enhance relational modeling; and (iii) we adopt a class-weighted cross-entropy loss to mitigate class imbalance and incorporate critical code region localization for improved model interpretability. Evaluated on the ReVeal dataset, our model achieves 88.25% accuracy and 48.23% F1-score—outperforming baseline methods by 4.6% and 16.9%, respectively—and significantly surpasses mainstream static analysis tools.
📝 Abstract
Detecting security vulnerabilities in source code remains challenging, particularly due to class imbalance in real-world datasets where vulnerable functions are under-represented. Existing learning-based methods often optimise for recall, leading to high false positive rates and reduced usability in development workflows. Furthermore, many approaches lack explainability, limiting their integration into security workflows. This paper presents ExplainVulD, a graph-based framework for vulnerability detection in C/C++ code. The method constructs Code Property Graphs and represents nodes using dual-channel embeddings that capture both semantic and structural information. These are processed by an edge-aware attention mechanism that incorporates edge-type embeddings to distinguish among program relations. To address class imbalance, the model is trained using class-weighted cross-entropy loss. ExplainVulD achieves a mean accuracy of 88.25 percent and an F1 score of 48.23 percent across 30 independent runs on the ReVeal dataset. These results represent relative improvements of 4.6 percent in accuracy and 16.9 percent in F1 score compared to the ReVeal model, a prior learning-based method. The framework also outperforms static analysis tools, with relative gains of 14.0 to 14.1 percent in accuracy and 132.2 to 201.2 percent in F1 score. Beyond improved detection performance, ExplainVulD produces explainable outputs by identifying the most influential code regions within each function, supporting transparency and trust in security triage.