On-Policy RL with Optimal Reward Baseline

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses two key challenges in reinforcement learning (RL) for aligning and reasoning with large language models (LLMs): training instability—arising from loose on-policy constraints—and computational inefficiency—stemming from reliance on auxiliary models. We propose a strictly on-policy training paradigm coupled with a theoretically optimal reward baseline estimator. Our method eliminates auxiliary models and explicit regularization terms by enforcing exact policy gradient updates and minimizing gradient variance. Evaluated on mathematical reasoning benchmarks, it significantly improves training stability, task performance, and policy consistency, while increasing output entropy to yield more diverse and less repetitive responses. The core contribution is the first unification of strict on-policy constraints with optimal baseline theory within an LLM RL alignment framework, enabling efficient, stable, and parameter-free end-to-end optimization.

Technology Category

Application Category

📝 Abstract

Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO introduces the optimal reward baseline that theoretically minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is provided at https://github.com/microsoft/LMOps/tree/main/opo.

Problem

Research questions and friction points this paper is trying to address.

Addresses training instability in RL algorithms

Reduces computational inefficiency from auxiliary models

Enhances exploration and minimizes gradient variance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exact on-policy training for stability

Optimal reward baseline reduces variance

No auxiliary models or regularization needed

🔎 Similar Papers

No similar papers found.