🤖 AI Summary
This work addresses the instability of policy gradient training in online reinforcement learning (RL)–based inference-time fine-tuning of large language models (LLMs). We systematically investigate the roles of various KL divergence variants—forward vs. reverse, normalized vs. unnormalized—in both gradient estimation and regularization. We propose the first unified Regularized Policy Gradient (RPG) framework, compatible with both fully differentiable losses and REINFORCE-style estimators. Our theoretical analysis characterizes how each KL variant affects convergence properties and inference performance, validated empirically across diverse LLM reasoning tasks. Experiments demonstrate that RPG significantly improves training stability and consistently outperforms strong baselines—including GRPO, REINFORCE++, and DAPO—on multiple LLM inference benchmarks. The implementation is publicly available.
📝 Abstract
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). Despite the widespread use of Kullback-Leibler (KL) regularization in policy gradient algorithms to stabilize training, the systematic exploration of how different KL divergence formulations can be estimated and integrated into surrogate loss functions for online reinforcement learning (RL) presents a nuanced and systematically explorable design space. In this paper, we propose regularized policy gradient (RPG), a systematic framework for deriving and analyzing KL-regularized policy gradient methods in the online RL setting. We derive policy gradients and corresponding surrogate loss functions for objectives regularized by both forward and reverse KL divergences, considering both normalized and unnormalized policy distributions. Furthermore, we present derivations for fully differentiable loss functions as well as REINFORCE-style gradient estimators, accommodating diverse algorithmic needs. We conduct extensive experiments on RL for LLM reasoning using these methods, showing improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO. The code is available at https://github.com/complex-reasoning/RPG.