Train Less, Learn More: Adaptive Efficient Rollout Optimization for Group-Based Reinforcement Learning

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical inefficiency in population-based reinforcement learning methods such as GRPO, where homogeneous rollout outcomes cause the advantage function to vanish, leading to absent gradient signals and wasted computation. To mitigate this issue, the authors propose AERO, a novel approach that dynamically avoids zero-advantage regions through adaptive control of rollout counts, a selective rejection mechanism, and Bayesian posterior estimation. AERO maintains effective policy optimization while substantially improving training efficiency. Empirical results demonstrate that, under an identical rollout budget, AERO reduces total training computation by approximately 48% and decreases per-step runtime by 45%, while matching or surpassing GRPO in performance on both Pass@8 and Avg@8 evaluation metrics.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) plays a central role in large language model (LLM) post-training. Among existing approaches, Group Relative Policy Optimization (GRPO) is widely used, especially for RL with verifiable rewards (RLVR) fine-tuning. In GRPO, each query prompts the LLM to generate a group of rollouts with a fixed group size $N$. When all rollouts in a group share the same outcome, either all correct or all incorrect, the group-normalized advantages become zero, yielding no gradient signal and wasting fine-tuning compute. We introduce Adaptive Efficient Rollout Optimization (AERO), an enhancement of GRPO. AERO uses an adaptive rollout strategy, applies selective rejection to strategically prune rollouts, and maintains a Bayesian posterior to prevent zero-advantage dead zones. Across three model configurations (Qwen2.5-Math-1.5B, Qwen2.5-7B, and Qwen2.5-7B-Instruct), AERO improves compute efficiency without sacrificing performance. Under the same total rollout budget, AERO reduces total training compute by about 48% while shortening wall-clock time per step by about 45% on average. Despite the substantial reduction in compute, AERO matches or improves Pass@8 and Avg@8 over GRPO, demonstrating a practical, scalable, and compute-efficient strategy for RL-based LLM alignment.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Group Relative Policy Optimization
Zero-Advantage
Compute Efficiency
LLM Alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Rollout
Selective Rejection
Bayesian Posterior
Zero-Advantage Mitigation
Compute-Efficient RL
🔎 Similar Papers
No similar papers found.