Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address two key bottlenecks in applying large language models (LLMs) to automated theorem proving—limited scalability of reinforcement learning (RL) during training and insufficient search depth during inference—this paper proposes a dual-path collaborative optimization framework. First, it introduces an off-policy RL method featuring multi-stage expert iteration, integrating AlphaZero-style multi-round policy/value updates, adaptive tactic filtering, and periodic retraining. Second, it designs a planner-driven, hierarchical multi-agent tree search architecture that enables parallel exploration and shared proof caching. This framework achieves, for the first time, joint improvement in both training efficiency and inference search depth. Evaluated on MiniF2F and ProofNet, it attains proof success rates of 95.08% and 41.4%, respectively—substantially surpassing prior state-of-the-art methods and establishing new domain records.

Technology Category

Application Category

📝 Abstract

The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces exttt{BFS-Prover-V2}, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. exttt{BFS-Prover-V2} achieves 95.08% and 41.4% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.

Problem

Research questions and friction points this paper is trying to address.

Scaling training-time reinforcement learning for LLM theorem provers

Enhancing inference-time compute with multi-agent search architecture

Overcoming performance plateaus in long-term RL for automated reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn off-policy RL framework with adaptive filtering

Planner-enhanced multi-agent search architecture

Hierarchical theorem decomposition with shared proof cache

🔎 Similar Papers

YOLO-MARL: You Only LLM Once for Multi-agent Reinforcement Learning