LLMs in Code Vulnerability Analysis: A Proof of Concept

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study addresses the efficiency and accuracy limitations of traditional security analysis methods in the face of increasingly large and complex modern software systems. It presents a systematic evaluation of five state-of-the-art open-source large language models—spanning both code-specific and general-purpose architectures—on tasks including C/C++ vulnerability identification, severity prediction, exploitability assessment, and patch generation. For the first time, it directly compares fine-tuning against zero-shot and few-shot prompting strategies. Results demonstrate that fine-tuning consistently outperforms prompt engineering across all tasks. Code-specific models exhibit clear advantages in low-resource settings for complex tasks, while general-purpose models achieve comparable performance under data-rich conditions. The study also reveals that existing automated metrics for patch quality, such as CodeBLEU and CodeBERTScore, often fail to accurately reflect the functional correctness and security effectiveness of generated repairs.

Technology Category

Application Category

📝 Abstract

Context: Traditional software security analysis methods struggle to keep pace with the scale and complexity of modern codebases, requiring intelligent automation to detect, assess, and remediate vulnerabilities more efficiently and accurately. Objective: This paper explores the incorporation of code-specific and general-purpose Large Language Models (LLMs) to automate critical software security tasks, such as identifying vulnerabilities, predicting severity and access complexity, and generating fixes as a proof of concept. Method: We evaluate five pairs of recent LLMs, including both code-based and general-purpose open-source models, on two recognized C/C++ vulnerability datasets, namely Big-Vul and Vul-Repair. Additionally, we compare fine-tuning and prompt-based approaches. Results: The results show that fine-tuning uniformly outperforms both zero-shot and few-shot approaches across all tasks and models. Notably, code-specialized models excel in zero-shot and few-shot settings on complex tasks, while general-purpose models remain nearly as effective. Discrepancies among CodeBLEU, CodeBERTScore, BLEU, and ChrF highlight the inadequacy of current metrics for measuring repair quality. Conclusions: This study contributes to the software security community by investigating the potential of advanced LLMs to improve vulnerability analysis and remediation.

Problem

Research questions and friction points this paper is trying to address.

code vulnerability analysis

Large Language Models

software security

vulnerability detection

automated remediation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Vulnerability Analysis

Fine-tuning