MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the limitations of existing GRPO-style algorithms, which suffer from weak training signals under high-accuracy prompts (e.g., already mastered or mostly correct), leading to policy drift, knowledge forgetting, and inefficient allocation of optimization resources. To mitigate harmful policy updates, the authors introduce a hinge-KL regularization term and devise a dynamic weighting mechanism that prioritizes optimization on mostly-correct prompts, thereby reinforcing the consolidation process from partial to full mastery. Integrating verifiable rewards, group-relative advantage estimation, and adaptive sample weighting, the proposed method significantly improves pass@1 performance across three mathematical reasoning benchmarks. Notably, it also yields consistent gains in pass@k metrics, suggesting that effective mastery consolidation enhances policy diversity and overcomes inherent limitations of conventional relative policy optimization.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach to improve the reasoning abilities of Large Language Models (LLMs). Among RLVR algorithms, Group Relative Policy Optimization (GRPO) and its variants have demonstrated strong performance and high training efficiency. However, GRPO-style objectives exhibit two issues on high accuracy prompts including mastered prompts (rollout accuracy =1) and majority-correct prompts (rollout accuracy in (0.5,1)). For mastered prompts, group-relative advantages vanish, yielding no training signal and unconstrained policy drift that can cause forgetting. For majority-correct prompts, the induced query weight shrinks as accuracy increases, weakening consolidation from partial correctness to mastery. To alleviate this, we propose Mastery-Consolidated Policy Optimization (MCPO), which introduces (i) a hinge-KL regularizer applied exclusively to mastered prompts to bound harmful policy drift between successive gradient steps, and (ii) a weighting mechanism that prioritizes majority-correct prompts to better allocate optimization effort. Extensive experiments across three mathematical benchmarks demonstrate that MCPO consistently improves pass@1 performance. Counter-intuitively, rather than restricting exploration, MCPO boosts pass@k metrics, indicating that mastery consolidation further catalyzes solution diversity.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards

Group Relative Policy Optimization

policy drift

mastery consolidation

reasoning models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mastery-Consolidated Policy Optimization

hinge-KL regularizer

policy drift