An Exploratory Study on Fine-Tuning Large Language Models for Secure Code Generation

๐Ÿ“… 2024-08-17
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 7
โœจ Influential: 2
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) pretrained on vulnerable open-source code tend to generate insecure code. Method: We propose parameter-efficient fine-tuning (PEFT) using LoRA and IA3 on a curated dataset of C/C++ vulnerability-fix commitsโ€”the first systematic evaluation of PEFT for security-aware code generation. We introduce a novel function-level and code-block-level sample organization paradigm, superior to file- or line-level alternatives, and construct an adversarial evaluation benchmark comprising 14,622 files covering 52 prevalent CWEs. Results: Our approach improves secure code generation rates by +6.4% for C and +5.4% for C++, demonstrating that fine-grained vulnerability-fix data significantly enhances model security alignment. This work establishes a reproducible, cost-effective PEFT pathway for security-oriented code generation.

Technology Category

Application Category

๐Ÿ“ Abstract
AI-powered coding assistants such as GitHub Copilot and OpenAI ChatGPT have achieved notable success in automating code generation. However, these tools rely on pre-trained Large Language Models (LLMs) that are typically trained on human-written code sourced from open-source project hosting sites like GitHub, which often contains inherent security vulnerabilities. These vulnerabilities may then be mirrored in the code generated by these LLMs, a critical risk revealed and highlighted by recent empirical studies. In this work, we present an exploratory study on whether fine-tuning pre-trained LLMs on datasets of vulnerability-fixing commits can promote secure code generation. We explored two parameter-efficient fine-tuning techniques (LoRa and IA3) on two pre-trained LLMs for code generation. We crawled a fine-tuning dataset (14,622 C and C++ files) for secure code generation by collecting code fixes of confirmed vulnerabilities from open-source repositories. Our evaluation dataset comprises 52 vulnerability scenarios designed to cover the top most dangerous C and C++ Common Weakness Enumerations (CWEs). Each scenario is a prompt that may induce LLMs to generate vulnerable code. Our exploration reveals that fine-tuning LLMs can improve secure code generation by 6.4% in C language and 5.4% in C++ language. We further experimented with fine-tuning LLMs using different versions of the collected secure code dataset (block, function, and line). We found that fine-tuning with function-level and block-level datasets achieves the best secure code generation performance, compared to the alternatives (file-level and line-level).
Problem

Research questions and friction points this paper is trying to address.

Fine-tuning LLMs to generate secure code by fixing vulnerabilities
Addressing security risks in AI-generated code from training data
Improving code safety through vulnerability-fixing commit datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning LLMs with vulnerability-fixing commit datasets
Using parameter-efficient techniques like LoRA and IA3
Optimizing secure code generation with function-level datasets
J
Junjie Li
Concordia University
F
Fazle Rabbi
Concordia University
C
Cheng Cheng
Concordia University
A
Aseem Sangalay
Delhi Technological University
Y
Yuan Tian
Queenโ€™s University
Jinqiu Yang
Jinqiu Yang
Concordia University
Automated Program RepairText Analytics of Software ArtifactsMining Software RepositoriesSoftware Engineering