Efficient RL Training for LLMs with Experience Replay

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work systematically investigates the underutilized potential of experience replay in reinforcement learning-based post-training of large language models, challenging the prevailing assumption that fresh online data generation is indispensable. By carefully balancing data staleness, sample diversity, and computational cost, the authors design an efficient replay buffer mechanism that effectively substitutes strict online sampling. Their approach demonstrates, for the first time, that experience replay can significantly reduce inference-time computational overhead while maintaining or even improving model performance and effectively preserving policy entropy. This finding offers a compelling alternative to costly online data collection, suggesting that strategic reuse of historical interactions can sustain training efficacy without compromising behavioral diversity or learning stability.

Technology Category

Application Category

📝 Abstract
While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.
Problem

Research questions and friction points this paper is trying to address.

Experience Replay
LLM post-training
on-policy data
computational cost
sample efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Experience Replay
LLM Post-training
On-policy Sampling
Replay Buffer
Policy Entropy
🔎 Similar Papers
No similar papers found.