🤖 AI Summary
To address the inefficiency and instability of online policy training in Reinforcement Learning with Verifiable Rewards (RLVR), where experience is discarded immediately after use, this paper proposes an experience-efficient reuse framework designed to enhance the reasoning capabilities of large language models (LLMs). The method introduces a novel dual-metric notion of “reasoning experience value”—comprising rollout correctness and entropy—to guide experience organization and prioritization. It further designs a hybrid offline objective function that integrates Group Relative Policy Optimization (GRPO) with verifiable reward signals, enabling effective experience replay. Experiments across five state-of-the-art LLMs demonstrate consistent improvements: +3.5 points on mathematical reasoning and +7.6 points on general reasoning benchmarks. Moreover, the approach significantly enhances training stability and outperforms conventional online policy methods, establishing a more sample-efficient and robust paradigm for RLVR-based reasoning optimization.
📝 Abstract
Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.