DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Existing RL-based verifiable reasoning (RLVR) methods for LLM mathematical reasoning training suffer from premature convergence due to sparse exploration, which frequently omits critical reasoning paths. To address this, we propose MCTS-RLVR—a novel framework that deeply integrates Monte Carlo Tree Search (MCTS) into the training loop. Our method introduces: (i) a global frontier node selection strategy for systematic path exploration; (ii) an entropy-based path filtering mechanism to enhance exploration quality; and (iii) dynamic solution caching coupled with adaptive replay for fine-grained credit assignment. By prioritizing algorithmic innovation over computational scaling, MCTS-RLVR achieves a new state-of-the-art average accuracy of 62.95% on mathematical reasoning benchmarks using a 1.5B-parameter model—surpassing prior RLVR approaches—while reducing training time to only 1/5.7 of conventional RLVR.

Technology Category

Application Category

📝 Abstract

Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.

Problem

Research questions and friction points this paper is trying to address.

Addresses training plateaus in RLVR with sparse exploration patterns

Overcomes insufficient exploration in reasoning paths during RL training

Solves diminishing performance gains despite increased computational investment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Monte Carlo Tree Search into RLVR training

Uses global frontier selection for promising nodes

Employs entropy-based guidance for confident path identification

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL