CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Rule-guided reinforcement learning (e.g., GRPO, RLOO) suffers from training instability and collapse in large language model post-training due to excessive policy updates. To address this, we propose a robust optimization framework balancing stability and performance. Our contributions are threefold: (1) theoretically grounded policy drift control via dynamic KL regularization; (2) log-ratio clipping to mitigate gradient anomalies; and (3) a PPO variant integrating rule-based reward modeling with entropy-regularized policy optimization. Experiments across diverse reasoning tasks demonstrate an average 14.3% improvement over baselines and a 92% reduction in training collapse rate. To our knowledge, this is the first approach achieving both high stability and strong performance for rule-guided RL in LM fine-tuning. The implementation is fully open-sourced and reproducible.

Technology Category

Application Category

📝 Abstract

Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of language models (LMs) with rule-based rewards. However, existing RL methods -- such as GRPO, REINFORCE++, and RLOO -- often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that CPGD significantly improves performance while maintaining training stability. Our implementation balances theoretical rigor with practical usability, offering a robust alternative for RL in the post-training of LMs. We release our code at https://github.com/ModalMinds/MM-EUREKA.

Problem

Research questions and friction points this paper is trying to address.

Stabilize rule-based RL training for language models

Prevent excessive policy updates causing instability

Improve performance while maintaining training stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces KL divergence policy drift constraint

Uses clip mechanism on log ratio

Balances theoretical rigor with usability

🔎 Similar Papers

No similar papers found.