DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the issue of mode collapse and loss of diversity in reinforcement learning caused by experience replay. To balance sample efficiency and training stability, the authors propose a novel approach that preserves policy diversity through a time-sensitive dynamic reference distribution derived from historical trajectories, prioritizing diversity maintenance over mere accuracy optimization. The method integrates FIFO-based adaptive buffer management, Jensen-Shannon divergence regularization, and Rank-k token probability analysis to effectively prevent diversity collapse while aligning policy evolution across training iterations. Experimental results demonstrate that the proposed approach significantly outperforms GRPO, RLEP, and Ex-GRPO on mathematical reasoning and Text-to-SQL tasks, achieving substantially higher generation diversity without compromising training efficiency.

Technology Category

Application Category

📝 Abstract
While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algorithms like GRPO are sample-inefficient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct policy updates, but this often incurs high computational costs and causes mode collapse via overfitting. We argue that historical data should prioritize sustaining diversity rather than simply reinforcing accuracy. To this end, we propose Dynamic Jensen-Shannon Replay (DyJR), a simple yet effective regularization framework using a dynamic reference distribution from recent trajectories. DyJR introduces two innovations: (1) A Time-Sensitive Dynamic Buffer that uses FIFO and adaptive sizing to retain only temporally proximal samples, synchronizing with model evolution; and (2) Jensen-Shannon Divergence Regularization, which replaces direct gradient updates with a distributional constraint to prevent diversity collapse. Experiments on mathematical reasoning and Text-to-SQL benchmarks demonstrate that DyJR significantly outperforms GRPO as well as baselines such as RLEP and Ex-GRPO, while maintaining training efficiency comparable to the original GRPO. Furthermore, from the perspective of Rank-$k$ token probability evolution, we show that DyJR enhances diversity and mitigates over-reliance on Rank-1 tokens, elucidating how specific sub-modules of DyJR influence the training dynamics.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Experience Replay
Mode Collapse
Diversity Preservation
Sample Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Jensen-Shannon Replay
Experience Replay
Diversity Preservation
Jensen-Shannon Divergence Regularization
Time-Sensitive Buffer
🔎 Similar Papers
No similar papers found.