Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Large language models (LLMs) face significant challenges in automated theorem proving (ATP), including sparse rewards, excessively long proof paths, and difficulty in multi-step logical reasoning. Method: This paper proposes the self-generated goal-conditioned Markov Decision Process (sG-MDP) framework, enabling agents to dynamically generate subgoals aligned with the current proof state. A modular architecture coordinates multiple 7B-parameter LLMs—each specialized for subgoal generation, policy synthesis, and tactic selection. Proof search employs a Monte Carlo Tree Search (MCTS)-inspired algorithm augmented with multi-model ensemble mechanisms to enhance logical structure modeling. Contribution/Results: Evaluated on PutnamBench, our method solves 26 undergraduate-level problems—the highest success rate reported for 7B-scale models. It establishes the first end-to-end ATP paradigm grounded in autonomous subgoal generation, markedly improving both interpretability and search efficiency in complex mathematical reasoning.

Technology Category

Application Category

📝 Abstract

Reasoning remains a challenging task for large language models (LLMs), especially within the logically constrained environment of automated theorem proving (ATP), due to sparse rewards and the vast scale of proofs. These challenges are amplified in benchmarks like PutnamBench, which contains university-level problems requiring complex, multi-step reasoning. To address this, we introduce self-generated goal-conditioned MDPs (sG-MDPs), a new framework in which agents generate and pursue their subgoals based on the evolving proof state. Given this more structured generation of goals, the resulting problem becomes more amenable to search. We then apply Monte Carlo Tree Search (MCTS)-like algorithms to solve the sG-MDP, instantiating our approach in Bourbaki (7B), a modular system that can ensemble multiple 7B LLMs for subgoal generation and tactic synthesis. On PutnamBench, Bourbaki (7B) solves 26 problems, achieving new state-of-the-art results with models at this scale.

Problem

Research questions and friction points this paper is trying to address.

Address sparse rewards in theorem proving with LLMs

Enable complex multi-step reasoning for university-level problems

Improve proof search via self-generated goal-conditioned MDPs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-generated goal-conditioned MDPs framework

Monte Carlo Tree Search algorithms

Modular system with multiple LLMs

🔎 Similar Papers

LeanAgent: Lifelong Learning for Formal Theorem Proving