π€ AI Summary
While reinforcement learning from verification feedback (RLVR) enhances large language modelsβ (LLMs) complex reasoning capabilities, it remains constrained by the base modelβs inherent capability limits and often triggers capability boundary collapse. This paper proposes RL-PLUS, a novel framework that jointly leverages internal reasoning traces and external data to mitigate policy degeneracy and reward sparsity. RL-PLUS introduces multi-level importance sampling to correct policy distribution shift and designs an exploration-aware advantage function to guide the model toward high-value reasoning paths. Furthermore, it integrates offline policy optimization to enable synergistic utilization of internal and external knowledge. Evaluated on six mathematical reasoning benchmarks, RL-PLUS achieves state-of-the-art performance, delivering average relative improvements of 21.1%β69.2% across diverse model families. Pass@k analysis confirms substantial alleviation of capability boundary collapse.
π Abstract
Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its inherently on-policy strategy with LLM's immense action space and sparse reward. Further, RLVR can lead to the capability boundary collapse, narrowing the LLM's problem-solving scope. To address this problem, we propose RL-PLUS, a novel approach that synergizes internal exploitation (i.e., Thinking) with external data (i.e., Learning) to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components: Multiple Importance Sampling to address for distributional mismatch from external data, and an Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. The results show that RL-PLUS achieves state-of-the-art performance compared with existing RLVR methods on six math reasoning benchmarks and exhibits superior performance on six out-of-distribution reasoning tasks. It also achieves consistent and significant gains across diverse model families, with average relative improvements ranging from 21.1% to 69.2%. Moreover, Pass@k curves across multiple benchmarks indicate that RL-PLUS effectively resolves the capability boundary collapse problem.