Process-Supervised Reinforcement Learning for Code Generation

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing reinforcement learning (RL) approaches for code generation predominantly rely on outcome-based supervision, whereas process supervision—critical for enhancing robustness and interpretability—remains underexplored due to the prohibitive cost of manual step-by-step annotation. Method: We propose PRLCoder, the first RL framework for code generation that enables scalable process supervision. It leverages a teacher model to perform line-level code mutation and refactoring, and employs automated compilation and execution to validate and label intermediate steps, thereby constructing high-quality process supervision data. We further design a fine-grained process reward model that jointly optimizes generation completeness and functional correctness. Contribution/Results: Experiments across multiple benchmarks demonstrate that PRLCoder significantly outperforms outcome-supervised baselines—especially on complex tasks—establishing, for the first time, systematic empirical evidence that process supervision substantially improves generation robustness, interpretability, and error localization capability.

Technology Category

Application Category

📝 Abstract

Existing reinforcement learning strategies based on outcome supervision have proven effective in enhancing the performance of large language models(LLMs) for code generation. While reinforcement learning based on process supervision has shown great promise in handling multi-step reasoning tasks, its effectiveness in code generation remains largely underexplored and underjustified. The primary obstacle stems from the resource-intensive nature of constructing high-quality process-supervised data, which demands substantial human expertise and computational resources. In response to this challenge, we propose a"statement mutation/refactoring-compile and execution verification"strategy: mutating and refactoring code line-by-line through a teacher model, and utilizing compiler execution results to automatically label each line, resulting in line-by-line process-supervised data, which is pivotal for training a process-supervised reward model. The trained reward model is then integrated into the PRLCoder framework, followed by experimental validation on several benchmarks. Experimental results demonstrate that process-supervised reinforcement learning significantly surpasses methods relying solely on outcome supervision. Notably, in tackling complex code generation tasks, process-supervised reinforcement learning shows a clear advantage, ensuring both the integrity of the code generation process and the correctness of the generation results.

Problem

Research questions and friction points this paper is trying to address.

Explores process-supervised reinforcement learning for code generation.

Addresses resource-intensive process-supervised data construction challenges.

Proposes a strategy for automatic line-by-line process supervision.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Line-by-line code mutation

Compiler execution verification

Process-supervised reward model

🔎 Similar Papers

RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning