LLM-based Vulnerability Discovery through the Lens of Code Metrics

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work investigates the fundamental cause of stagnation in large language models’ (LLMs) vulnerability detection performance, revealing that their predictions rely predominantly on shallow syntactic code metrics—such as cyclomatic complexity and line count—rather than deep semantic understanding of program behavior. Method: We introduce the first causal inference framework to rigorously establish the causal relationship between LLM vulnerability predictions and traditional code metrics, complemented by controlled variable experiments to quantify model sensitivity to these features. Contribution/Results: Experiments demonstrate that lightweight classifiers trained solely on handcrafted code metrics achieve performance on par with state-of-the-art LLMs, confirming that current LLM-based detectors are fundamentally constrained by surface-level statistical patterns. This challenges the implicit assumption that LLMs possess robust, semantics-aware code comprehension capabilities. Our findings expose a critical limitation in prevailing approaches and provide a principled foundation for developing truly semantics-driven vulnerability detection models.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) excel in many tasks of software engineering, yet progress in leveraging them for vulnerability discovery has stalled in recent years. To understand this phenomenon, we investigate LLMs through the lens of classic code metrics. Surprisingly, we find that a classifier trained solely on these metrics performs on par with state-of-the-art LLMs for vulnerability discovery. A root-cause analysis reveals a strong correlation and a causal effect between LLMs and code metrics: When the value of a metric is changed, LLM predictions tend to shift by a corresponding magnitude. This dependency suggests that LLMs operate at a similarly shallow level as code metrics, limiting their ability to grasp complex patterns and fully realize their potential in vulnerability discovery. Based on these findings, we derive recommendations on how research should more effectively address this challenge.

Problem

Research questions and friction points this paper is trying to address.

Investigating why LLMs underperform in vulnerability discovery compared to other tasks

Analyzing the correlation between LLM predictions and shallow code metrics

Proposing research directions to improve LLMs' grasp of complex vulnerability patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed LLMs using classic code metrics

Found metrics-based classifier matches LLM performance

Revealed causal link between metrics and LLM predictions

🔎 Similar Papers

Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG