Beyond Importance Sampling: Rejection-Gated Policy Optimization

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the instability in traditional policy optimization methods that rely on importance sampling, which suffers from high variance and heavy-tailed distributions. The authors propose Rejection-Gated Policy Optimization (RGPO), elevating rejection mechanisms to a core optimization principle by employing a differentiable acceptance gate to dynamically filter trustworthy samples, thereby replacing importance sampling weights. RGPO unifies existing algorithms such as TRPO and PPO within a single framework and, through a smooth gating function and dual-ratio design, ensures bounded gradient variance and provides approximate monotonic improvement guarantees. This enables effective online preference fine-tuning and alignment via reinforcement learning from human feedback (RLHF). Evaluated on Qwen2.5-1.5B-Instruct, RGPO achieves a 14.8% higher reward and a 16.0% lower KL divergence to the reference model compared to PPO, demonstrating Pareto dominance.

Technology Category

Application Category

📝 Abstract

We propose a new perspective on policy optimization: rather than reweighting all samples by their importance ratios, an optimizer should select which samples are trustworthy enough to drive a policy update. Building on this view, we introduce Rejection-Gated Policy Optimization (RGPO), which replaces the importance sampling ratio r_theta = pi_theta / pi_old with a smooth, differentiable acceptance gate alpha_theta(s, a) = g(r_theta(s, a)) in the range [0, 1]. Unlike prior work that applies rejection sampling as a data-level heuristic before training, RGPO elevates rejection to an optimization principle: the gate participates directly in gradient computation and is implicitly updated alongside the policy. RGPO provides a unified framework: the policy gradients of TRPO, PPO, and REINFORCE all correspond to specific choices of the effective gradient weight w(r) = g'(r) * r. We prove that RGPO guarantees finite, bounded gradient variance even when importance sampling ratios are heavy-tailed (where IS variance diverges). We further show that RGPO incurs only a bounded, controllable bias and provides an approximate monotonic policy improvement guarantee analogous to TRPO. RGPO matches PPO in computational cost, requires no second-order optimization, and extends naturally to RLHF-style preference alignment. In online preference fine-tuning of Qwen2.5-1.5B-Instruct on Anthropic HH-RLHF (n = 3 seeds), RGPO uses a dual-ratio gate that anchors learning to both the previous policy and the reference model, achieving a Pareto-dominant outcome: the highest reward among online RL methods (+14.8% vs. PPO-RLHF) and the lowest KL divergence to the reference model (-16.0% vs. PPO-RLHF, -53.1% vs. GRPO).

Problem

Research questions and friction points this paper is trying to address.

importance sampling

policy optimization

gradient variance

preference alignment

heavy-tailed ratios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rejection-Gated Policy Optimization

importance sampling

differentiable gating