Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

During RL-based LLM training, numerous prompts become “residual prompts” due to zero reward variance, yielding no gradient signal and thereby reducing effective sample count and constraining policy exploration. To address this, we propose ERPO—a novel framework that systematically leverages residual prompt data for the first time. ERPO identifies residual prompts via historical reward tracking and introduces an adaptive temperature adjustment mechanism to enable targeted exploration over them while preserving policy stability. Crucially, ERPO requires no additional annotations or external reward validators and integrates seamlessly into mainstream RLHF frameworks (e.g., GRPO) as a plug-and-play module. Experiments on the Qwen2.5 series demonstrate that ERPO significantly outperforms strong baselines on mathematical reasoning benchmarks (e.g., GSM8K, MATH), achieving an average +3.2% absolute improvement. Moreover, it enhances both sample efficiency and policy diversity.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs). The Group Relative Policy Optimization (GRPO) family has demonstrated strong performance in training LLMs with RLVR. However, as models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal. Consequently, fewer prompts contribute to training, reducing diversity and hindering effectiveness. To fully exploit these residual prompts, we propose the Explore Residual Prompts in Policy Optimization (ERPO) framework, which encourages exploration on residual prompts and reactivates their training signals. ERPO maintains a history tracker for each prompt and adaptively increases the sampling temperature for residual prompts that previously produced all correct responses. This encourages the model to generate more diverse reasoning traces, introducing incorrect responses that revive training signals. Empirical results on the Qwen2.5 series demonstrate that ERPO consistently surpasses strong baselines across multiple mathematical reasoning benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Exploiting residual prompts with zero variance rewards in RL training

Addressing reduced training diversity from non-contributing prompts

Reactivating training signals through adaptive exploration strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

ERPO framework exploits residual prompts for training

Adaptive temperature increase encourages diverse reasoning traces

History tracker reactivates training signals on stagnant prompts

🔎 Similar Papers

Self-playing Adversarial Language Game Enhances LLM Reasoning