A Comprehensive Study of LLM Secure Code Generation

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the prevalent disconnection between security and functional evaluation in current secure code generation methods, as well as the assessment bias arising from overreliance on a single static analysis tool (e.g., CodeQL). For the first time, it conducts simultaneous, multi-dimensional security detection and functional correctness verification on a unified benchmark. Methodologically, it integrates three complementary static analyzers—CodeQL, Semgrep, and SonarQube—and introduces a dual-LLM collaborative verification mechanism alongside a joint security–functionality consistency evaluation framework. Key findings reveal that mainstream approaches often achieve “illusory security” at the expense of functionality, exhibiting failure modes such as line-deletion repairs and junk-code generation; CodeQL significantly overestimates security efficacy while missing diverse real-world vulnerabilities; and most methods fail to jointly improve both security and functionality—some even underperform baseline LLMs. The study establishes a more rigorous, reproducible, and holistic evaluation paradigm for secure code generation.

Technology Category

Application Category

📝 Abstract
LLMs are widely used in software development. However, the code generated by LLMs often contains vulnerabilities. Several secure code generation methods have been proposed to address this issue, but their current evaluation schemes leave several concerns unaddressed. Specifically, most existing studies evaluate security and functional correctness separately, using different datasets. That is, they assess vulnerabilities using security-related code datasets while validating functionality with general code datasets. In addition, prior research primarily relies on a single static analyzer, CodeQL, to detect vulnerabilities in generated code, which limits the scope of security evaluation. In this work, we conduct a comprehensive study to systematically assess the improvements introduced by four state-of-the-art secure code generation techniques. Specifically, we apply both security inspection and functionality validation to the same generated code and evaluate these two aspects together. We also employ three popular static analyzers and two LLMs to identify potential vulnerabilities in the generated code. Our study reveals that existing techniques often compromise the functionality of generated code to enhance security. Their overall performance remains limited when evaluating security and functionality together. In fact, many techniques even degrade the performance of the base LLM. Our further inspection reveals that these techniques often either remove vulnerable lines of code entirely or generate ``garbage code'' that is unrelated to the intended task. Moreover, the commonly used static analyzer CodeQL fails to detect several vulnerabilities, further obscuring the actual security improvements achieved by existing techniques. Our study serves as a guideline for a more rigorous and comprehensive evaluation of secure code generation performance in future work.
Problem

Research questions and friction points this paper is trying to address.

Evaluate security and functionality of LLM-generated code together
Assess vulnerabilities using multiple static analyzers and LLMs
Identify limitations of existing secure code generation techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines security and functionality validation together
Uses multiple static analyzers for vulnerability detection
Evaluates four secure code generation techniques comprehensively
🔎 Similar Papers
No similar papers found.
S
Shih-Chieh Dai
University of Utah
J
Jun Xu
University of Utah
Guanhong Tao
Guanhong Tao
Assistant Professor, University of Utah
Machine LearningComputer Security