🤖 AI Summary
This work investigates the fundamental cause of stagnation in large language models’ (LLMs) vulnerability detection performance, revealing that their predictions rely predominantly on shallow syntactic code metrics—such as cyclomatic complexity and line count—rather than deep semantic understanding of program behavior.
Method: We introduce the first causal inference framework to rigorously establish the causal relationship between LLM vulnerability predictions and traditional code metrics, complemented by controlled variable experiments to quantify model sensitivity to these features.
Contribution/Results: Experiments demonstrate that lightweight classifiers trained solely on handcrafted code metrics achieve performance on par with state-of-the-art LLMs, confirming that current LLM-based detectors are fundamentally constrained by surface-level statistical patterns. This challenges the implicit assumption that LLMs possess robust, semantics-aware code comprehension capabilities. Our findings expose a critical limitation in prevailing approaches and provide a principled foundation for developing truly semantics-driven vulnerability detection models.
📝 Abstract
Large language models (LLMs) excel in many tasks of software engineering, yet progress in leveraging them for vulnerability discovery has stalled in recent years. To understand this phenomenon, we investigate LLMs through the lens of classic code metrics. Surprisingly, we find that a classifier trained solely on these metrics performs on par with state-of-the-art LLMs for vulnerability discovery. A root-cause analysis reveals a strong correlation and a causal effect between LLMs and code metrics: When the value of a metric is changed, LLM predictions tend to shift by a corresponding magnitude. This dependency suggests that LLMs operate at a similarly shallow level as code metrics, limiting their ability to grasp complex patterns and fully realize their potential in vulnerability discovery. Based on these findings, we derive recommendations on how research should more effectively address this challenge.