RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address training instability, policy drift from pretrained weights, and high energy consumption in reinforcement learning (RL) fine-tuning of large language models (LLMs), this paper proposes RLEP, a two-stage framework. Its core innovation is a **validation-guided experience replay mechanism**: in Stage I, high-quality reasoning trajectories are collected and rigorously validated; in Stage II, these verified trajectories are replayed during RL training to suppress unproductive exploration and anchor beneficial policies. RLEP integrates mini-batch RL updates with explicit policy constraints. Evaluated on Qwen2.5-Math-7B, it achieves significantly faster convergence—surpassing baseline accuracy with fewer training steps—and attains 39.9%, 22.3%, and 82.2% accuracy on AIME-2024, AIME-2025, and AMC-2023, respectively. These results demonstrate that validation-guided replay simultaneously enhances training stability and task performance.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present emph{RLEP}, -- ,Reinforcement Learning with Experience rePlay, -- ,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.
Problem

Research questions and friction points this paper is trying to address.

Stabilize RL training for large language models
Prevent policy drift from pretrained weights
Improve reasoning accuracy with fewer updates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase framework with verified trajectories
Replays high-quality examples during training
Blends new rollouts with replayed successes
🔎 Similar Papers
No similar papers found.