🤖 AI Summary
This work addresses the inefficiency in group-based policy optimization caused by fixed rollout allocation, which often leads to either computational waste or insufficient exploration. The authors propose HORA, a training-free adaptive allocation strategy that introduces, for the first time, the concept of “hit utility.” By dynamically optimizing resource distribution to maximize the total posterior hit probability per rollout batch, HORA leverages Bayesian posterior modeling and seamlessly integrates with existing group advantage estimators such as GRPO and RLOO. Evaluated across four mathematical reasoning benchmarks and three model scales—comprising twelve experimental settings—HORA matches or surpasses the Pass@K performance of compute-matched GRPO in ten cases, demonstrating its effectiveness and compatibility with diverse group policy frameworks.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a central paradigm for improving the reasoning capabilities of large language models. Group-based policy optimization methods, such as GRPO, typically allocate a fixed number of rollouts to every prompt. This uniform allocation can be inefficient: it over-allocates compute to prompts whose sampled groups are already saturated while under-exploring prompts for which additional samples may reveal useful correct trajectories. To address this limitation, we introduce hit utility, the posterior probability that at least one rollout in a proposed additional allocation for a prompt will be correct. Building on this notion, we propose Hit-Utility Optimal Rollout Allocation (HORA), a learning-free rollout allocation policy that maximizes total posterior hit utility within each allocation batch. HORA adaptively reallocates rollout budgets while leaving the downstream reward evaluation and group-based advantage estimator unchanged. Across four mathematical reasoning benchmarks and three model scales, HORA preserves comparable Pass@1 and improves Pass@K over compute-matched GRPO in ten of twelve model--benchmark configurations, with one tie and one saturated exception. It is also drop-in compatible with other group-based estimators such as RLOO. Ablation studies indicate that the uniform prior used by HORA is competitive with five prompt-conditioned learned-prior alternatives.