Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the severe distributional mismatch that arises when decoupling rollout generation from policy optimization in large language model reinforcement learning, which often leads to unstable training. To mitigate this issue, the authors propose the Jackpot framework, which introduces an Optimal Budget Rejection Sampling (OBRS) mechanism to effectively reduce the distributional gap between the rollout model and the target policy under a constrained computational budget. Jackpot integrates top-k probability estimation, batch-level bias correction, and a unified training objective to enable stable and efficient decoupled reinforcement learning. Experiments demonstrate that, on Qwen3-8B-Base, Jackpot matches the performance of on-policy RL after only 300 training steps, significantly outperforms importance sampling baselines, and substantially improves training stability.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces a severe distribution mismatch that destabilizes learning. We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top-$k$ probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, \sys substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps of batchsize 64. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

large language models

distribution mismatch

rollout generation

policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal Budgeted Rejection Sampling

Distribution Mismatch

Decoupled Rollout