🤖 AI Summary
This work proposes SecureCodeRL, a reinforcement learning framework that jointly optimizes functional correctness and security in code generation by large language models. Addressing the challenge of sparse rewards in strict input–output matching scenarios, the approach introduces a novel partial-credit reward mechanism that integrates intermediate signals—such as syntactic validity, successful execution, and output alignment—to guide learning more effectively. Security feedback is incorporated via the static analysis tool Bandit. Evaluated on the APPS+ dataset using supervised fine-tuning (SFT) followed by proximal policy optimization (PPO), SecureCodeRL improves syntactic validity from 45% to 60%, achieves a 5% success rate in passing at least one test case—the first such result reported—and generates code that remains 100% secure under Bandit detection.
📝 Abstract
Large Language Models (LLMs) can generate plausible code, but in settings that require exact stdin/stdout behavior they frequently produce programs that compile yet fail tests, and in some cases they introduce security-sensitive patterns. This paper presents SecureCodeRL, a reinforcement learning (RL) pipeline for security-aware code generation that optimizes a combined reward R = {\alpha}Rfunc + \b{eta}Rsec. The key idea is a partial-credit functional reward that assigns intermediate scores for syntactic validity, successful execution, and producing output, reducing reward sparsity that otherwise stalls learning on competitive programming style tasks. I evaluate supervised fine-tuning (SFT) and PPO variants on a small held-out prompt set from APPS+ and observe that PPO with partial credit (using a continued-training variant) improves syntax validity from 45% (SFT) to 60% and achieves the only non-zero test success signal in this pilot evaluation (5% at-least-one-test-pass), while remaining 100% clean under Bandit static analysis. Although Bandit findings were absent in this small evaluation, the security term is integrated into training to discourage insecure shortcuts when they appear.