🤖 AI Summary
To address the high computational cost and low data efficiency of on-policy methods (e.g., GRPO) in reinforcement learning for large language models, this paper proposes a replay-based policy optimization framework. Methodologically, it introduces a multi-policy replay buffer enabling off-policy sample retrieval and reuse, coupled with a prompt-level multi-sample policy gradient update and a group-wise relative advantage normalization mechanism—thereby overcoming the limitations of strict on-policy sampling. These innovations significantly enhance per-step optimization information content and sample utilization efficiency. Empirically, on seven mathematical reasoning benchmarks, Qwen2.5-Math-1.5B and Qwen3-1.7B achieve absolute improvements of +18.4 and +4.1 points over GRPO, respectively; effective optimization steps increase by 48%, while computational overhead rises by only 15%.
📝 Abstract
Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15%$ while raising the number of effective optimization steps by $48%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at https://github.com/SihengLi99/RePO.