Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and weak learning signals in existing reinforcement learning from verifiable reasoning (RLVR) methods—such as GRPO and DAPO—which stem from their reliance on extensive rollouts and highly skewed correctness distributions that yield low intra-group reward variance. To overcome these limitations, we propose arrol, the first RLVR framework to incorporate online rollout pruning during generation. arrol employs a lightweight quality head to dynamically predict the success probability of partial rollouts, enabling early pruning while balancing correctness. This pruning mechanism is seamlessly integrated with the inference engine and further enhanced by dynamic re-batching, policy gradient updates, and test-time scaling with weighted aggregation. Evaluated on Qwen-3 and LLaMA-3.2 (1B–8B), arrol achieves up to 1.7× faster training and improves average accuracy by 2.30–2.99 points, with test-time scaling yielding further gains up to 8.33 points.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at https://github.com/Hsu1023/ARRoL.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards
rollout pruning
computational cost
reward sparsity
learning signal
Innovation

Methods, ideas, or system contributions that make the work stand out.

online rollout pruning
Reinforcement Learning with Verifiable Rewards
quality head
training acceleration
test-time scaling
🔎 Similar Papers
No similar papers found.