Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the excessive computational overhead of rollouts in LLM reinforcement learning (e.g., PPO/GRPO), this work identifies strong temporal consistency in prompt value across training iterations. Leveraging this observation, we propose GRESO—a lightweight, online pre-rollout filtering algorithm that dynamically skips low-value prompts before rollout execution. GRESO requires no additional human annotations; instead, it employs a reward model to adaptively estimate prompt values and self-tune filtering thresholds in real time. It integrates seamlessly into PPO-style frameworks such as GRPO. Evaluated on multiple mathematical reasoning benchmarks, GRESO achieves up to 2.4× rollout speedup and 2.0× end-to-end training acceleration, with zero accuracy degradation. Its core contribution lies in the first identification and exploitation of prompt-value temporal consistency, enabling efficient, adaptive, and fully unsupervised rollout filtering—marking a paradigm shift from static or supervised filtering strategies.

Technology Category

Application Category

📝 Abstract

Reinforcement learning, such as PPO and GRPO, has powered recent breakthroughs in LLM reasoning. Scaling rollout to sample more prompts enables models to selectively use higher-quality data for training, which can stabilize RL training and improve model performance. However, this comes at the cost of significant computational overhead. In this paper, we show that a substantial portion of this overhead can be avoided by skipping uninformative prompts before rollout. Our analysis of reward dynamics reveals a strong temporal consistency in prompt value: prompts that are uninformative in one epoch of training are likely to remain uninformative in future epochs. Based on these insights, we propose GRESO (GRPO with Efficient Selective Rollout), an online, lightweight pre-rollout filtering algorithm that predicts and skips uninformative prompts using reward training dynamics. By evaluating GRESO on a broad range of math reasoning benchmarks and models, such as Qwen2.5-Math-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, and Qwen2.5-Math-7B, we show that GRESO achieves up to 2.4x wall-clock time speedup in rollout and up to 2.0x speedup in total training time without accuracy degradation.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational overhead in RL for LLM reasoning

Selectively skipping uninformative prompts during rollout

Improving training efficiency without accuracy loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective rollout skips uninformative prompts

Lightweight pre-rollout filtering algorithm GRESO

Achieves speedup without accuracy degradation

🔎 Similar Papers

EVOLvE: Evaluating and Optimizing LLMs For Exploration

2024-10-08arXiv.orgCitations: 14

Authors to Follow