SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the lack of comprehensive evaluation of structural and semantic reasoning capabilities of large language models (LLMs) in C source code vulnerability analysis. To this end, we introduce the first dual-dimensional trustworthiness benchmark, explicitly decoupling structural reasoning—focused on data/control-flow identification—and semantic reasoning—centered on logical consistency verification. Our methodology features a controllable adversarial code perturbation mechanism and a fine-grained, interpretable evaluation framework integrating static program analysis, control- and data-flow graph (CFG/DFG) modeling, multi-level human annotation, and response consistency quantification. Experimental results across mainstream LLMs reveal sub-58% accuracy on both reasoning tasks, exposing their overreliance on superficial pattern matching rather than deep, principled reasoning. The benchmark is publicly released to advance trustworthy code-aware AI research, establishing a new methodological paradigm for rigorous, capability-specific LLM evaluation in software security.

Technology Category

Application Category

📝 Abstract
As Large Language Models (LLMs) evolve in understanding and generating code, accurately evaluating their reliability in analyzing source code vulnerabilities becomes increasingly vital. While studies have examined LLM capabilities in tasks like vulnerability detection and repair, they often overlook the importance of both structure and semantic reasoning crucial for trustworthy vulnerability analysis. To address this gap, we introduce SV-TrustEval-C, a benchmark designed to evaluate LLMs' abilities for vulnerability analysis of code written in the C programming language through two key dimensions: structure reasoning - assessing how models identify relationships between code elements under varying data and control flow complexities; and semantic reasoning - examining their logical consistency in scenarios where code is structurally and semantically perturbed. Our results show that current LLMs are far from satisfactory in understanding complex code relationships and that their vulnerability analyses rely more on pattern matching than on robust logical reasoning. These findings underscore the effectiveness of the SV-TrustEval-C benchmark and highlight critical areas for enhancing the reasoning capabilities and trustworthiness of LLMs in real-world vulnerability analysis tasks. Our initial benchmark dataset is publicly available.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' reliability in source code vulnerability analysis
Assessing structure and semantic reasoning in code vulnerability tasks
Improving LLMs' logical reasoning for trustworthy vulnerability detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs' structure and semantic reasoning for code vulnerabilities
Introduces SV-TrustEval-C benchmark for C language vulnerability analysis
Assesses code relationships and logical consistency under perturbations
🔎 Similar Papers
No similar papers found.
Y
Yansong Li
University of Ottawa
P
Paula Branco
University of Ottawa
A
Alexander M. Hoole
OpenText
Manish Marwah
Manish Marwah
HP Labs
data scienceapplied machine learningcybersecuritycomputational sustainability
H
H. M. Koduvely
OpenText
G
Guy-Vincent Jourdan
University of Ottawa
S
Stephan Jou
OpenText