Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the inefficiency in sampling caused by fixed rollout allocation in online reinforcement learning, particularly in settings with verifiable rewards. The authors propose a variance-informed predictive rollout allocation strategy that employs a lightweight Gaussian process to model and predict the success probabilities of different prompts. Under a computational budget constraint, rollout resources are dynamically allocated via convex optimization to minimize the variance of policy gradients. This approach represents the first integration of gradient variance minimization with explicit budget constraints, overcoming the limitations of conventional uniform or heuristic allocation schemes. Experimental results demonstrate that the method significantly improves both sample efficiency and final performance across multiple benchmark tasks, outperforming existing strategies.

Technology Category

Application Category

📝 Abstract

Sampling efficiency is a key bottleneck in reinforcement learning with verifiable rewards. Existing group-based policy optimization methods, such as GRPO, allocate a fixed number of rollouts for all training prompts. This uniform allocation implicitly treats all prompts as equally informative, and could lead to inefficient computational budget usage and impede training progress. We introduce VIP, a Variance-Informed Predictive allocation strategy that allocates a given rollout budget to the prompts in the incumbent batch to minimize the expected gradient variance of the policy update. At each iteration, VIP uses a lightweight Gaussian process model to predict per-prompt success probabilities based on recent rollouts. These probability predictions are translated into variance estimates, which are then fed into a convex optimization problem to determine the optimal rollout allocations under a hard compute budget constraint. Empirical results show that VIP consistently improves sampling efficiency and achieves higher performance than uniform or heuristic allocation strategies in multiple benchmarks.

Problem

Research questions and friction points this paper is trying to address.

sampling efficiency

rollout allocation

verifiable rewards

online reinforcement learning

computational budget

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive rollout allocation

variance-informed optimization

online reinforcement learning