SecureCodeRL: Security-Aware Reinforcement Learning for Code Generation with Partial-Credit Rewards

📅 2026-01-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This work proposes SecureCodeRL, a reinforcement learning framework that jointly optimizes functional correctness and security in code generation by large language models. Addressing the challenge of sparse rewards in strict input–output matching scenarios, the approach introduces a novel partial-credit reward mechanism that integrates intermediate signals—such as syntactic validity, successful execution, and output alignment—to guide learning more effectively. Security feedback is incorporated via the static analysis tool Bandit. Evaluated on the APPS+ dataset using supervised fine-tuning (SFT) followed by proximal policy optimization (PPO), SecureCodeRL improves syntactic validity from 45% to 60%, achieves a 5% success rate in passing at least one test case—the first such result reported—and generates code that remains 100% secure under Bandit detection.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) can generate plausible code, but in settings that require exact stdin/stdout behavior they frequently produce programs that compile yet fail tests, and in some cases they introduce security-sensitive patterns. This paper presents SecureCodeRL, a reinforcement learning (RL) pipeline for security-aware code generation that optimizes a combined reward R = {\alpha}Rfunc + \b{eta}Rsec. The key idea is a partial-credit functional reward that assigns intermediate scores for syntactic validity, successful execution, and producing output, reducing reward sparsity that otherwise stalls learning on competitive programming style tasks. I evaluate supervised fine-tuning (SFT) and PPO variants on a small held-out prompt set from APPS+ and observe that PPO with partial credit (using a continued-training variant) improves syntax validity from 45% (SFT) to 60% and achieves the only non-zero test success signal in this pilot evaluation (5% at-least-one-test-pass), while remaining 100% clean under Bandit static analysis. Although Bandit findings were absent in this small evaluation, the security term is integrated into training to discourage insecure shortcuts when they appear.
Problem

Research questions and friction points this paper is trying to address.

code generation
security vulnerabilities
reward sparsity
large language models
functional correctness
Innovation

Methods, ideas, or system contributions that make the work stand out.

partial-credit reward
security-aware code generation
reinforcement learning for code
combined functional-security reward
LLM fine-tuning with PPO
🔎 Similar Papers
No similar papers found.
S
Suryansh Singh Sijwali
Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA
Suman Saha
Suman Saha
Pennsylvania State University