🤖 AI Summary
To address training instability, policy drift from pretrained weights, and high energy consumption in reinforcement learning (RL) fine-tuning of large language models (LLMs), this paper proposes RLEP, a two-stage framework. Its core innovation is a **validation-guided experience replay mechanism**: in Stage I, high-quality reasoning trajectories are collected and rigorously validated; in Stage II, these verified trajectories are replayed during RL training to suppress unproductive exploration and anchor beneficial policies. RLEP integrates mini-batch RL updates with explicit policy constraints. Evaluated on Qwen2.5-Math-7B, it achieves significantly faster convergence—surpassing baseline accuracy with fewer training steps—and attains 39.9%, 22.3%, and 82.2% accuracy on AIME-2024, AIME-2025, and AMC-2023, respectively. These results demonstrate that validation-guided replay simultaneously enhances training stability and task performance.
📝 Abstract
Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present emph{RLEP}, -- ,Reinforcement Learning with Experience rePlay, -- ,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.