🤖 AI Summary
Large language models for code generation frequently produce insecure code, posing significant risks to software development safety. This paper systematically evaluates seven parameter-efficient fine-tuning (PEFT) methods to enhance the security-aware code generation capability of large models, focusing on Python and Java. We prioritize functional correctness while minimizing vulnerability rates. We propose *Discovery Prompt Tuning*, a novel prompt optimization strategy that—uniquely—reveals the critical role of decoding temperature in governing output security. Furthermore, we integrate temperature-scaled sampling with the TrojanPuzzle framework to rigorously assess adversarial robustness. Evaluated on CodeGen2-16B, our approach achieves an overall safety rate of 87.65%, improving upon the baseline by 13.5 percentage points (to 80.86%). This reduction corresponds to approximately 203,700 fewer vulnerable code snippets per million generated—demonstrating a scalable, robust, and lightweight optimization paradigm for secure code generation.
📝 Abstract
Code-generating Large Language Models (LLMs) significantly accelerate software development. However, their frequent generation of insecure code presents serious risks. We present a comprehensive evaluation of seven parameter-efficient fine-tuning (PEFT) techniques, demonstrating substantial gains in secure code generation without compromising functionality. Our research identifies prompt-tuning as the most effective PEFT method, achieving an 80.86% Overall-Secure-Rate on CodeGen2 16B, a 13.5-point improvement over the 67.28% baseline. Optimizing decoding strategies through sampling temperature further elevated security to 87.65%. This equates to a reduction of approximately 203,700 vulnerable code snippets per million generated. Moreover, prompt and prefix tuning increase robustness against poisoning attacks in our TrojanPuzzle evaluation, with strong performance against CWE-79 and CWE-502 attack vectors. Our findings generalize across Python and Java, confirming prompt-tuning's consistent effectiveness. This study provides essential insights and practical guidance for building more resilient software systems with LLMs.