Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability and low sample efficiency in existing reinforcement learning fine-tuning methods for large language models (LLMs), which often discard high-return but high-divergence samples due to hard clipping of policy ratios. To overcome this limitation, we propose R²VPO, a novel framework that introduces the variance of policy ratios as a regularization term in policy optimization, replacing the rigid clipping mechanism with a smooth trust-region objective. R²VPO further integrates dual-policy gradient optimization and dynamic reweighting of stale data to enhance training stability and efficiency. Experiments on mathematical reasoning tasks demonstrate that R²VPO achieves up to a 17% performance improvement over baseline methods while reducing the required number of rollouts by approximately 50%.

Technology Category

Application Category

📝 Abstract
On-policy reinforcement learning (RL), particularly Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), has become the dominant paradigm for fine-tuning large language models (LLMs). While policy ratio clipping stabilizes training, this heuristic hard constraint incurs a fundamental cost: it indiscriminately truncates gradients from high-return yet high-divergence actions, suppressing rare but highly informative"eureka moments"in complex reasoning. Moreover, once data becomes slightly stale, hard clipping renders it unusable, leading to severe sample inefficiency. In this work, we revisit the trust-region objective in policy optimization and show that explicitly constraining the \emph{variance (second central moment) of the policy ratio} provides a principled and smooth relaxation of hard clipping. This distributional constraint stabilizes policy updates while preserving gradient signals from valuable trajectories. Building on this insight, we propose $R^2VPO$ (Ratio-Variance Regularized Policy Optimization), a novel primal-dual framework that supports stable on-policy learning and enables principled off-policy data reuse by dynamically reweighting stale samples rather than discarding them. We extensively evaluate $R^2VPO$ on fine-tuning state-of-the-art LLMs, including DeepSeek-Distill-Qwen-1.5B and the openPangu-Embedded series (1B and 7B), across challenging mathematical reasoning benchmarks. Experimental results show that $R^2VPO$ consistently achieves superior asymptotic performance, with average relative gains of up to 17% over strong clipping-based baselines, while requiring approximately 50% fewer rollouts to reach convergence. These findings establish ratio-variance control as a promising direction for improving both stability and data efficiency in RL-based LLM alignment.
Problem

Research questions and friction points this paper is trying to address.

policy optimization
sample inefficiency
policy ratio clipping
large language models
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

ratio-variance regularization
policy optimization
sample efficiency
off-policy reuse
trust-region method