RePO: Replay-Enhanced Policy Optimization

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

To address the high computational cost and low data efficiency of on-policy methods (e.g., GRPO) in reinforcement learning for large language models, this paper proposes a replay-based policy optimization framework. Methodologically, it introduces a multi-policy replay buffer enabling off-policy sample retrieval and reuse, coupled with a prompt-level multi-sample policy gradient update and a group-wise relative advantage normalization mechanism—thereby overcoming the limitations of strict on-policy sampling. These innovations significantly enhance per-step optimization information content and sample utilization efficiency. Empirically, on seven mathematical reasoning benchmarks, Qwen2.5-Math-1.5B and Qwen3-1.7B achieve absolute improvements of +18.4 and +4.1 points over GRPO, respectively; effective optimization steps increase by 48%, while computational overhead rises by only 15%.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15%$ while raising the number of effective optimization steps by $48%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at https://github.com/SihengLi99/RePO.

Problem

Research questions and friction points this paper is trying to address.

Reduces high computational costs in RL for LLMs

Improves data efficiency in policy optimization

Enhances performance using diverse replay strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses replay buffer for off-policy samples

Enhances policy optimization diversity

Reduces computational cost effectively

🔎 Similar Papers

No similar papers found.