π€ AI Summary
This work investigates whether current fine-tuned large language models genuinely understand the root causes of software vulnerabilities or merely rely on superficial functional context patterns. To address this, the authors introduce the concept of βsemantic traps,β highlighting how models may conflate functional associations with true security semantics. They present TrapEval, the first evaluation framework designed to disentangle functional patterns from vulnerability semantics, featuring V2N and V2P paired datasets derived from real-world open-source projects. Through semantic-preserving perturbations, cross-dataset testing, and CodeBLEU-based metrics, the study reveals that despite strong performance on conventional benchmarks, mainstream models heavily depend on functional shortcuts when distinguishing vulnerable code from its patched counterparts. Their robustness degrades significantly under semantic perturbations, underscoring a critical deficiency in causal reasoning capabilities.
π Abstract
LLMs demonstrate promising performance in software vulnerability detection after fine-tuning. However, it remains unclear whether these gains reflect a genuine understanding of vulnerability root causes or merely an exploitation of functional patterns. In this paper, we identify a critical failure mode termed the"semantic trap,"where fine-tuned LLMs achieve high detection scores by associating certain functional domains with vulnerability likelihood rather than reasoning about the underlying security semantics. To systematically evaluate this phenomenon, we propose TrapEval, a comprehensive evaluation framework designed to disentangle vulnerability root cause from functional pattern. TrapEval introduces two complementary datasets derived from real-world open-source projects: V2N, which pairs vulnerable code with unrelated benign code, and V2P, which pairs vulnerable code with its corresponding patched version, forcing models to distinguish near-identical code that differs only in subtle security-critical logic. Using TrapEval, we fine-tune five representative state-of-the-art LLMs across three model families and evaluate them under cross-dataset testing, semantic-preserving perturbations, and varying degrees of semantic gap measured by CodeBLEU. Our empirical results reveal that, despite improvements in metrics, fine-tuned LLMs consistently struggle to distinguish vulnerable code from its patched counterpart, exhibit severe robustness degradation under minor semantic-preserving transformations, and rely heavily on functional-context shortcuts when the semantic gap is small. These findings provide strong evidence that current fine-tuning practices often fail to impart true vulnerability reasoning. Our findings serve as a wake-up call: high benchmark scores on traditional datasets may be illusory, masking the model's inability to understand the true causal logic of vulnerabilities.