Posterior-GRPO: Rewarding Reasoning Processes in Code Generation

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing reinforcement learning–based code generation methods rely solely on final-output rewards, making them prone to reward hacking and unable to supervise intermediate reasoning quality. Method: We propose Posterior-GRPO—a novel framework that strictly restricts process rewards to only successfully completed task instances, thereby fundamentally mitigating reward hacking. It integrates the LCB-RB benchmark with outcome-driven (OD-based) reward modeling and introduces an optimization-degeneration generation strategy to construct high-quality reasoning preference data. A 7B-parameter reward model is trained and deployed within a conditional RL paradigm for end-to-end reasoning alignment. Results: Experiments demonstrate significant gains over pure outcome-reward baselines (+4.5% on code generation), competitive performance with GPT-4-Turbo, and strong generalization to mathematical reasoning tasks. Both the model and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has significantly advanced code generation for large language models (LLMs). However, current paradigms rely on outcome-based rewards from test cases, neglecting the quality of the intermediate reasoning process. While supervising the reasoning process directly is a promising direction, it is highly susceptible to reward hacking, where the policy model learns to exploit the reasoning reward signal without improving final outcomes. To address this, we introduce a unified framework that can effectively incorporate the quality of the reasoning process during RL. First, to enable reasoning evaluation, we develop LCB-RB, a benchmark comprising preference pairs of superior and inferior reasoning processes. Second, to accurately score reasoning quality, we introduce an Optimized-Degraded based (OD-based) method for reward model training. This method generates high-quality preference pairs by systematically optimizing and degrading initial reasoning paths along curated dimensions of reasoning quality, such as factual accuracy, logical rigor, and coherence. A 7B parameter reward model with this method achieves state-of-the-art (SOTA) performance on LCB-RB and generalizes well to other benchmarks. Finally, we introduce Posterior-GRPO (P-GRPO), a novel RL method that conditions process-based rewards on task success. By selectively applying rewards to the reasoning processes of only successful outcomes, P-GRPO effectively mitigates reward hacking and aligns the model's internal reasoning with final code correctness. A 7B parameter model with P-GRPO achieves superior performance across diverse code generation tasks, outperforming outcome-only baselines by 4.5%, achieving comparable performance to GPT-4-Turbo. We further demonstrate the generalizability of our approach by extending it to mathematical tasks. Our models, dataset, and code are publicly available.

Problem

Research questions and friction points this paper is trying to address.

Improving code generation by rewarding reasoning processes in RL

Mitigating reward hacking in process-based RL for code generation

Enhancing reasoning quality evaluation via optimized-degraded reward training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed LCB-RB benchmark for reasoning evaluation

Introduced OD-based reward model training method

Proposed P-GRPO RL method to mitigate reward hacking

🔎 Similar Papers

What Makes Large Language Models Reason in (Multi-Turn) Code Generation?