An Empirical Evaluation of LLM-Generated Code Security Across Prompting Methods

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This study addresses the prevalence of security vulnerabilities—such as weak cryptography and insufficient input validation—in code generated by large language models (LLMs), noting that existing prompting strategies offer limited improvements in security. The authors systematically evaluate the security performance of five LLMs across Java, C++, C, and Python using diverse prompting techniques and propose a novel Weakness-Aware Zero-shot Chain-of-Thought (WA-0CoT) prompting method that leverages Common Weakness Enumeration (CWE) mappings to guide models toward security-relevant reasoning. Empirical results indicate that while all prompting approaches, including WA-0CoT, fail to significantly reduce overall vulnerability frequency or density, they substantially alter the distribution of CWE weakness categories. This effect varies notably by programming language, underscoring the necessity of tailoring security-aware prompts to both language-specific constructs and model characteristics.

📝 Abstract

The growing use of Large Language Models (LLMs) for automated code generation has enhanced software development efficiency, but often at the cost of security. Generated code frequently overlooks critical concerns, leaving it vulnerable to issues such as weak encryption and improper input validation. To investigate this problem, we present a comprehensive empirical evaluation of the security quality of LLM-generated code across five LLMs and four programming languages (Java, C++, C, and Python), examining the impact of multiple prompt engineering methods. We introduce a weaknesses-aware zero-shot chain-of-thought (WA-0CoT) prompting strategy that enriches prompts with security context using CWE mappings to guide model reasoning. Our empirical analysis, supported by chi-square tests, finds no statistically significant reductions in vulnerability frequency or density across prompt methods. However, prompting strategies, including WA-0CoT, systematically influence the compositional distribution of CWE categories, with effects varying by programming language. These findings suggest that while security-aware prompting alters the structure of generated weaknesses, prompt engineering alone is insufficient to reliably reduce overall vulnerability levels. The results highlight the importance of language-aware and model-aware prompt design when evaluating the security properties of LLM-generated code.

Problem

Research questions and friction points this paper is trying to address.

LLM-generated code

code security

prompting methods

vulnerability

CWE

Innovation

Methods, ideas, or system contributions that make the work stand out.

WA-0CoT

prompt engineering

LLM-generated code