RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

While reinforcement learning from verification feedback (RLVR) enhances large language models’ (LLMs) complex reasoning capabilities, it remains constrained by the base model’s inherent capability limits and often triggers capability boundary collapse. This paper proposes RL-PLUS, a novel framework that jointly leverages internal reasoning traces and external data to mitigate policy degeneracy and reward sparsity. RL-PLUS introduces multi-level importance sampling to correct policy distribution shift and designs an exploration-aware advantage function to guide the model toward high-value reasoning paths. Furthermore, it integrates offline policy optimization to enable synergistic utilization of internal and external knowledge. Evaluated on six mathematical reasoning benchmarks, RL-PLUS achieves state-of-the-art performance, delivering average relative improvements of 21.1%–69.2% across diverse model families. Pass@k analysis confirms substantial alleviation of capability boundary collapse.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its inherently on-policy strategy with LLM's immense action space and sparse reward. Further, RLVR can lead to the capability boundary collapse, narrowing the LLM's problem-solving scope. To address this problem, we propose RL-PLUS, a novel approach that synergizes internal exploitation (i.e., Thinking) with external data (i.e., Learning) to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components: Multiple Importance Sampling to address for distributional mismatch from external data, and an Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. The results show that RL-PLUS achieves state-of-the-art performance compared with existing RLVR methods on six math reasoning benchmarks and exhibits superior performance on six out-of-distribution reasoning tasks. It also achieves consistent and significant gains across diverse model families, with average relative improvements ranging from 21.1% to 69.2%. Moreover, Pass@k curves across multiple benchmarks indicate that RL-PLUS effectively resolves the capability boundary collapse problem.

Problem

Research questions and friction points this paper is trying to address.

Overcoming capability boundary collapse in LLMs with hybrid-policy optimization

Addressing sparse rewards and large action spaces in RL for LLMs

Enhancing reasoning beyond base LLM limits via internal and external data synergy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid-policy optimization combining internal and external data

Multiple Importance Sampling for distributional mismatch

Exploration-Based Advantage Function for high-value paths

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL