🤖 AI Summary
Existing LLM-based code generation methods rely solely on unit test pass rates as reward signals, ensuring functional correctness while neglecting non-functional attributes such as maintainability, security, and overall code quality.
Method: We propose a multidimensional code quality–driven reinforcement learning framework that integrates a quantifiable evaluation library—covering readability, conciseness, security, and other dimensions—into the Generalized Reinforcement Policy Optimization (GRPO) framework as structured reward signals. This enables end-to-end integration of quality metrics into RL reward modeling for the first time. We further validate improvements via double-blind expert evaluation.
Contribution/Results: Experiments demonstrate that our approach preserves functional correctness while significantly improving quantitative code quality scores. Double-blind expert assessments consistently confirm enhanced maintainability and security. The method achieves joint optimization of functional and non-functional properties across diverse programming tasks.
📝 Abstract
Large Language Models (LLMs) are gaining widespread use for code generation. Recent training procedures use execution feedback as a reward signal, typically focusing on the functional correctness of the code, using unit test pass rate as a reward signal. However, this reward signal fails to capture notions of maintainability, quality and safety of the code produced. We address this under-explored area and develop a comprehensive library to quantify various aspects of code quality, and use it as a reward in GRPO. We find GRPO increases code quality according to this measure, which is confirmed by expert, blinded human annotators.