🤖 AI Summary
This work addresses the issue of mode collapse and loss of diversity in reinforcement learning caused by experience replay. To balance sample efficiency and training stability, the authors propose a novel approach that preserves policy diversity through a time-sensitive dynamic reference distribution derived from historical trajectories, prioritizing diversity maintenance over mere accuracy optimization. The method integrates FIFO-based adaptive buffer management, Jensen-Shannon divergence regularization, and Rank-k token probability analysis to effectively prevent diversity collapse while aligning policy evolution across training iterations. Experimental results demonstrate that the proposed approach significantly outperforms GRPO, RLEP, and Ex-GRPO on mathematical reasoning and Text-to-SQL tasks, achieving substantially higher generation diversity without compromising training efficiency.
📝 Abstract
While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algorithms like GRPO are sample-inefficient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct policy updates, but this often incurs high computational costs and causes mode collapse via overfitting. We argue that historical data should prioritize sustaining diversity rather than simply reinforcing accuracy. To this end, we propose Dynamic Jensen-Shannon Replay (DyJR), a simple yet effective regularization framework using a dynamic reference distribution from recent trajectories. DyJR introduces two innovations: (1) A Time-Sensitive Dynamic Buffer that uses FIFO and adaptive sizing to retain only temporally proximal samples, synchronizing with model evolution; and (2) Jensen-Shannon Divergence Regularization, which replaces direct gradient updates with a distributional constraint to prevent diversity collapse. Experiments on mathematical reasoning and Text-to-SQL benchmarks demonstrate that DyJR significantly outperforms GRPO as well as baselines such as RLEP and Ex-GRPO, while maintaining training efficiency comparable to the original GRPO. Furthermore, from the perspective of Rank-$k$ token probability evolution, we show that DyJR enhances diversity and mitigates over-reliance on Rank-1 tokens, elucidating how specific sub-modules of DyJR influence the training dynamics.