Prioritized Replay for RL Post-training

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of data selection in reinforcement learning–based post-training of large language models and their reliance on manually designed curricula by proposing a problem-level adaptive prioritization framework. The method dynamically estimates problem difficulty through empirical success rates, focusing on “medium-difficulty” samples without requiring predefined difficulty labels or auxiliary predictors. It employs heap-based priority sampling, a success-rate–driven scoring mechanism, and a lightweight periodic re-evaluation strategy to effectively mitigate training starvation and catastrophic forgetting. Integrated within the GRPO post-training framework, this approach substantially enhances the quality of learning signals, yielding an efficient, scalable, and fully performance-driven automatic data selection mechanism.

Technology Category

Application Category

📝 Abstract
We introduce a problem-level prioritization framework for RL post-training of large language models. Building on insights from prioritized replay in deep RL, as well as prior observations that rollouts with intermediate success rates tend to produce stronger learning signals under methods such as GRPO, our approach selects problems according to a simple, model-driven priority score derived from empirical success statistics. In contrast to conventional curriculum strategies that emphasize easier tasks early in training, the resulting schedule naturally focuses training on problems that are neither consistently solved nor consistently failed, while deprioritizing those that contribute little gradient information. The method yields a continuously adapting and automatic prioritization process that requires no predefined difficulty tiers, auxiliary predictors, or external labels. We further introduce lightweight mechanisms for practical deployment, including heap-based prioritized sampling and periodic retesting of solved and unsolved problems to mitigate starvation and forgetting. Overall, the approach offers a principled and scalable alternative to manually designed curricula while aligning data selection directly with the dynamics of GRPO-based post-training.
Problem

Research questions and friction points this paper is trying to address.

prioritized replay
RL post-training
curriculum learning
large language models
problem prioritization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prioritized Replay
RL Post-training
Curriculum Learning
GRPO
Automatic Prioritization
🔎 Similar Papers
No similar papers found.