Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the challenge that existing reward mechanisms struggle to simultaneously ensure answer correctness and high-quality reasoning processes: outcome-based reward models cannot distinguish between superior and inferior reasoning paths, while process-based reward models are prone to reward hacking. To overcome this, the paper proposes Process-Aware Policy Optimization (PAPO), which introduces a novel decoupled advantage normalization mechanism within the Group Relative Policy Optimization framework. PAPO separately applies intra-group and global normalization to advantages derived from an Outcome Reward Model (ORM) and a rubric-based Process Reward Model (PRM), then fuses them effectively. This approach maintains answer accuracy while leveraging process feedback without succumbing to reward hacking. Experiments show that PAPO consistently outperforms ORM across six benchmarks, achieving 51.3% accuracy on OlympiadBench (versus 46.3% for ORM) and continuing to improve even when ORM performance plateaus or degrades.

Technology Category

Application Category

📝 Abstract

We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offer richer supervision, but directly using PRM scores causes reward hacking, where models exploit verbosity to inflate scores while accuracy collapses. PAPO resolves both by composing the advantage from an outcome component Aout, derived from ORM and normalized over all responses, and a process component Aproc, derived from a rubric-based PRM and normalized exclusively among correct responses. This decoupled design ensures that Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal. Experiments across multiple model scales and six benchmarks demonstrate that PAPO consistently outperforms ORM, reaching 51.3% vs.\ 46.3% on OlympiadBench while continuing to improve as ORM plateaus and declines.

Problem

Research questions and friction points this paper is trying to address.

outcome reward model

process reward model

reward hacking

reasoning quality

advantage signal

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Advantage Normalization

Process-Aware Policy Optimization

Reward Hacking Mitigation