Improving Sampling Efficiency in RLVR through Adaptive Rollout and Response Reuse

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

In RL with Verifiable Rewards (RLVR), the Group Relative Policy Optimization (GRPO) algorithm suffers from degenerate advantage estimation and low sampling efficiency due to intra-group reward homogeneity. To address this, we propose Adaptive Rollout and Response Reuse Policy Optimization (AR3PO). Our method introduces: (1) a prompt-difficulty-aware adaptive rollout mechanism that dynamically allocates generation resources; and (2) a historical correct-response caching and reuse technique that furnishes policy updates with stable, high signal-to-noise-ratio learning signals. Integrated within the GRPO framework, AR3PO incorporates normalized reward computation and dynamic response scheduling. Empirical evaluation on 7B–32B language models shows that AR3PO matches or exceeds DAPO’s performance while reducing rollout cost by up to 4.2×, significantly improving training efficiency and sample utilization.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved impressive reasoning performance, with reinforcement learning with verifiable rewards (RLVR) emerging as a standard paradigm for post-training. A representative algorithm, group relative policy optimization (GRPO) (Shao et al., 2024), computes advantages by normalizing outcome rewards within response groups, but suffers from a vanishing advantage issue when all responses in a group receive identical rewards. To address this issue, we propose Adaptive Rollout and Response Reuse Policy Optimization (AR3PO), a sampling efficient RLVR algorithm that introduces two novel techniques: adaptive rollout, which dynamically allocates more responses to difficult prompts while saving computation on easier ones, and response reuse, which leverages previously generated correct responses to provide useful training signals. We compare AR3PO with strong RLVR baselines on multiple representative benchmarks using two different families of base models. Across the 7B and 8B models, AR3PO consistently outperforms GRPO and matches or surpasses DAPO (Yu et al., 2025), reducing rollout cost by up to 4.2x. On the larger 32B model, AR3PO achieves comparable performance to DAPO at similar training steps while maintaining substantially lower rollout cost.

Problem

Research questions and friction points this paper is trying to address.

Addresses vanishing advantage in RLVR reward normalization

Improves sampling efficiency through adaptive response allocation

Reduces rollout costs while maintaining competitive model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive rollout dynamically allocates responses to prompts

Response reuse leverages previous correct responses for training

Algorithm reduces rollout cost by up to 4.2 times

🔎 Similar Papers

Revisiting Prefix-tuning: Statistical Benefits of Reparameterization among Prompts