🤖 AI Summary
Existing reinforcement learning–based code generation methods rely solely on final-output rewards, making them prone to reward hacking and unable to supervise intermediate reasoning quality.
Method: We propose Posterior-GRPO—a novel framework that strictly restricts process rewards to only successfully completed task instances, thereby fundamentally mitigating reward hacking. It integrates the LCB-RB benchmark with outcome-driven (OD-based) reward modeling and introduces an optimization-degeneration generation strategy to construct high-quality reasoning preference data. A 7B-parameter reward model is trained and deployed within a conditional RL paradigm for end-to-end reasoning alignment.
Results: Experiments demonstrate significant gains over pure outcome-reward baselines (+4.5% on code generation), competitive performance with GPT-4-Turbo, and strong generalization to mathematical reasoning tasks. Both the model and dataset are publicly released.
📝 Abstract
Reinforcement learning (RL) has significantly advanced code generation for large language models (LLMs). However, current paradigms rely on outcome-based rewards from test cases, neglecting the quality of the intermediate reasoning process. While supervising the reasoning process directly is a promising direction, it is highly susceptible to reward hacking, where the policy model learns to exploit the reasoning reward signal without improving final outcomes. To address this, we introduce a unified framework that can effectively incorporate the quality of the reasoning process during RL. First, to enable reasoning evaluation, we develop LCB-RB, a benchmark comprising preference pairs of superior and inferior reasoning processes. Second, to accurately score reasoning quality, we introduce an Optimized-Degraded based (OD-based) method for reward model training. This method generates high-quality preference pairs by systematically optimizing and degrading initial reasoning paths along curated dimensions of reasoning quality, such as factual accuracy, logical rigor, and coherence. A 7B parameter reward model with this method achieves state-of-the-art (SOTA) performance on LCB-RB and generalizes well to other benchmarks. Finally, we introduce Posterior-GRPO (P-GRPO), a novel RL method that conditions process-based rewards on task success. By selectively applying rewards to the reasoning processes of only successful outcomes, P-GRPO effectively mitigates reward hacking and aligns the model's internal reasoning with final code correctness. A 7B parameter model with P-GRPO achieves superior performance across diverse code generation tasks, outperforming outcome-only baselines by 4.5%, achieving comparable performance to GPT-4-Turbo. We further demonstrate the generalizability of our approach by extending it to mathematical tasks. Our models, dataset, and code are publicly available.