Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

During RLVR (Reinforcement Learning from Verifiable Rewards) training, the advantage of large language models (LLMs) over their base models in mathematical reasoning diminishes as the sampling budget increases—revealing a fundamental bottleneck rooted in the base model’s inherently constrained search space. Method: We propose RAPO (Reward-Augmented Policy Optimization), a novel algorithm that (i) employs forward KL divergence regularization to encourage out-of-distribution exploration and (ii) dynamically reweights the reference policy to enable adaptive in-distribution exploration—thereby overcoming the restrictive exploration bias imposed by conventional reverse KL regularization. Contribution/Results: RAPO enables reward-driven policy optimization without supervised fine-tuning, relying solely on verifiable rewards. On AIME 2024 and 2025, it substantially surpasses the performance ceiling of base models, solving multiple previously intractable complex mathematical problems for the first time. These results empirically validate that innovations in exploration mechanisms significantly enhance both the effectiveness and generalizability of LLMs’ mathematical reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has recently enhanced the reasoning capabilities of large language models (LLMs), particularly for mathematical problem solving. However, a fundamental limitation remains: as the sampling budget increases, the advantage of RLVR-trained models over their pretrained bases often diminishes or even vanishes, revealing a strong dependence on the base model's restricted search space. We attribute this phenomenon to the widespread use of the reverse Kullback-Leibler (KL) divergence regularizer, whose mode-seeking behavior keeps the policy trapped inside the base model's support region and hampers wider exploration. To address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an algorithm to promote broader yet focused exploration. Our method (i) utilizes the forward KL penalty to replace the reverse KL penalty for out-of-distribution exploration, and (ii) reweights the reference policy to facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset, without supervised fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO consistently improves problem-solving performance. Notably, RAPO enables models to surpass the base model's performance ceiling and solves previously intractable problems, advancing the frontier of RLVR for challenging reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning via broader exploration in reinforcement learning

Overcoming base model limitations in mathematical problem solving tasks

Addressing vanishing RLVR advantages with improved policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses forward KL penalty for broader exploration

Reweights reference policy for adaptive exploration

Trains models without supervised fine-tuning

🔎 Similar Papers

No similar papers found.