On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability of policy gradient training in online reinforcement learning (RL)–based inference-time fine-tuning of large language models (LLMs). We systematically investigate the roles of various KL divergence variants—forward vs. reverse, normalized vs. unnormalized—in both gradient estimation and regularization. We propose the first unified Regularized Policy Gradient (RPG) framework, compatible with both fully differentiable losses and REINFORCE-style estimators. Our theoretical analysis characterizes how each KL variant affects convergence properties and inference performance, validated empirically across diverse LLM reasoning tasks. Experiments demonstrate that RPG significantly improves training stability and consistently outperforms strong baselines—including GRPO, REINFORCE++, and DAPO—on multiple LLM inference benchmarks. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). Despite the widespread use of Kullback-Leibler (KL) regularization in policy gradient algorithms to stabilize training, the systematic exploration of how different KL divergence formulations can be estimated and integrated into surrogate loss functions for online reinforcement learning (RL) presents a nuanced and systematically explorable design space. In this paper, we propose regularized policy gradient (RPG), a systematic framework for deriving and analyzing KL-regularized policy gradient methods in the online RL setting. We derive policy gradients and corresponding surrogate loss functions for objectives regularized by both forward and reverse KL divergences, considering both normalized and unnormalized policy distributions. Furthermore, we present derivations for fully differentiable loss functions as well as REINFORCE-style gradient estimators, accommodating diverse algorithmic needs. We conduct extensive experiments on RL for LLM reasoning using these methods, showing improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO. The code is available at https://github.com/complex-reasoning/RPG.
Problem

Research questions and friction points this paper is trying to address.

Systematically explores KL divergence formulations for policy gradient algorithms
Proposes RPG framework for KL-regularized methods in online RL
Improves training stability and performance in LLM reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

KL-regularized policy gradient framework
Forward and reverse KL divergence integration
Differentiable loss functions for RL