🤖 AI Summary
Automated vulnerability-fixing commit (VFC) detection suffers from fragmented data, weak model generalization, and evaluation biases, leading to high false-negative rates. This work establishes the first unified benchmark, integrating over 20 datasets and more than 180,000 commits, to systematically evaluate code language models ranging from 125M to 14B parameters across diverse input modalities—code diffs, commit messages, and semantic context—and evaluation strategies. The study reveals that relying solely on code changes is insufficient for learning transferable security semantics; incorporating commit messages substantially improves performance, whereas enriched contextual information offers no benefit. Model performance drops by approximately 17% under grouped or temporal splits compared to random partitioning. At a 0.5% false-positive rate, purely code-based models miss over 93% of VFCs. This benchmark exposes critical limitations of current approaches and establishes a new paradigm for reliable VFC detection.
📝 Abstract
Automated detection of vulnerability-fixing commits (VFCs) is critical for timely security patch deployment, as advisory databases lag patch releases by a median of 25 days and many fixes never receive advisories. We present a comprehensive evaluation of code language model based VFC detection through a unified framework consolidating over 20 fragmented datasets spanning more than 180000 commits. Across over 180 experiments with fine-tuned models from 125 M to 14 B parameters, we find no evidence that models acquire transferable security-relevant code understanding from code changes alone. When commit messages are available, they dominate model attention, and when removed, an attribution analysis shows that enriching diffs with additional intra-procedural semantic context does not shift model attention toward the code changes. Group-stratified evaluation exposes approximately 17% performance drops compared to random splits, while temporal splits on aggregated datasets prove unreliable due to compositional shift in the underlying project distributions. At a false positive rate of 0.5% all fine-tuned code-only models miss over 93% of vulnerabilities. Larger and more diverse training data or generative approaches show preliminary improvements but do not resolve the underlying limitations. To support future research on code-centric VFC detection, we release our unified framework and evaluation suite.