Policy Improvement Reinforcement Learning

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses a critical limitation in existing reinforcement learning post-training methods—the absence of mechanisms to validate the efficacy of policy updates, which often leads to optimization drift or collapse. To mitigate this, the authors propose the PIRL framework, which reframes the optimization objective from immediate reward maximization to cumulative policy improvement across training iterations. Central to this framework is the PIPO algorithm, which introduces, for the first time, a policy improvement feedback mechanism. By employing a sliding window to retrospectively validate historical baselines, PIPO establishes a self-correcting closed-loop optimization process that guarantees each policy update positively contributes to final performance. Empirical evaluations on mathematical reasoning benchmarks demonstrate that PIPO achieves superior stability and performance compared to GRPO and its variants.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by within-group (batch) reward signals -- means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance. Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization through retrospective verification. At each iteration, PIPO evaluates whether the previous update yielded genuine improvement against a sliding-window historical baseline, then actively reinforces beneficial updates and suppresses the harmful ones -- transforming an open-loop process into a self-correcting one. We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation, and experiments on mathematical reasoning benchmarks demonstrate improved stability and performance over GRPO and its variants.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Policy Improvement

Reward Verification

Open-loop Optimization

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy Improvement

Closed-loop Optimization

Reinforcement Learning