🤖 AI Summary
This paper systematically identifies a gradient bias issue arising from KL regularization estimators in reinforcement learning (RL) training of large language models (LLMs): existing KL divergence estimation methods lack theoretical guarantees when integrated into objective functions, causing misalignment between the optimization target and the actual computed gradients. To address this, the authors propose an unbiased gradient configuration principle, empirically validated across both on-policy (PPO) and off-policy (DPO, GRPO) RL frameworks using Qwen2.5-7B, Llama-3.1-8B-Instruct, and Qwen3-4B-Instruct-2507. Their analysis provides the first empirical evidence that unbiased KL estimation significantly improves both in-domain and out-of-domain generalization. Moreover, in asynchronous offline RL settings, KL regularization is shown to play a previously unrecognized role—stabilizing training dynamics, suppressing gradient oscillations, and enhancing convergence robustness. These findings offer foundational insights for designing theoretically sound and empirically effective RLHF algorithms.
📝 Abstract
The reasoning performance of large language models (LLMs) can be substantially improved by training them with reinforcement learning (RL). The RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy. Since computing the KL divergence exactly is intractable, various estimators are used in practice to estimate it from on-policy samples. Despite its wide adoption, including in several open-source libraries, there is no systematic study analyzing the numerous ways of incorporating KL estimators in the objective and their effect on the downstream performance of RL-trained models. Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation. In this paper, we further analyze these practices and study the gradients of several estimators configurations, revealing how design choices shape gradient bias. We substantiate these findings with empirical observations by RL fine-tuning exttt{Qwen2.5-7B}, exttt{Llama-3.1-8B-Instruct} and exttt{Qwen3-4B-Instruct-2507} with different configurations and evaluating their performance on both in- and out-of-distribution tasks. Through our analysis, we observe that, in on-policy settings: (1) estimator configurations with biased gradients can result in training instabilities; and (2) using estimator configurations resulting in unbiased gradients leads to better performance on in-domain as well as out-of-domain tasks. We also investigate the performance resulting from different KL configurations in off-policy settings and observe that KL regularization can help stabilize off-policy RL training resulting from asynchronous setups.