$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high variance in policy gradient baselines under sparse-reward reinforcement learning. The authors propose the $V_{0.5}$ method, which adaptively fuses a general value model’s predictions—used as a prior—with the empirical mean from sparse rollouts. A real-time statistical hypothesis test dynamically evaluates the reliability of the prior and adjusts the rollout budget accordingly. This approach substantially reduces the mean squared error of baseline estimates and remains stable even under extreme sparsity, with as few as four samples per batch. Evaluated on six mathematical reasoning benchmarks, $V_{0.5}$ demonstrates faster convergence and over 10% performance improvement compared to GRPO and DAPO, confirming its efficiency and robustness.

Technology Category

Application Category

📝 Abstract
In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as $V_0$), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose $V_{0.5}$, which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model's prior. By constructing a hypothesis test to evaluate the prior's reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator's Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that $V_{0.5}$ significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards
value estimation
sparse rollouts
advantage baseline
policy gradients
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalist Value Model
Sparse RL Rollouts
Adaptive Baseline Estimation
Dynamic Budget Allocation
Real-time Hypothesis Testing