🤖 AI Summary
Large language models (LLMs) face significant challenges in automated theorem proving (ATP), including sparse rewards, excessively long proof paths, and difficulty in multi-step logical reasoning. Method: This paper proposes the self-generated goal-conditioned Markov Decision Process (sG-MDP) framework, enabling agents to dynamically generate subgoals aligned with the current proof state. A modular architecture coordinates multiple 7B-parameter LLMs—each specialized for subgoal generation, policy synthesis, and tactic selection. Proof search employs a Monte Carlo Tree Search (MCTS)-inspired algorithm augmented with multi-model ensemble mechanisms to enhance logical structure modeling. Contribution/Results: Evaluated on PutnamBench, our method solves 26 undergraduate-level problems—the highest success rate reported for 7B-scale models. It establishes the first end-to-end ATP paradigm grounded in autonomous subgoal generation, markedly improving both interpretability and search efficiency in complex mathematical reasoning.
📝 Abstract
Reasoning remains a challenging task for large language models (LLMs), especially within the logically constrained environment of automated theorem proving (ATP), due to sparse rewards and the vast scale of proofs. These challenges are amplified in benchmarks like PutnamBench, which contains university-level problems requiring complex, multi-step reasoning. To address this, we introduce self-generated goal-conditioned MDPs (sG-MDPs), a new framework in which agents generate and pursue their subgoals based on the evolving proof state. Given this more structured generation of goals, the resulting problem becomes more amenable to search. We then apply Monte Carlo Tree Search (MCTS)-like algorithms to solve the sG-MDP, instantiating our approach in Bourbaki (7B), a modular system that can ensemble multiple 7B LLMs for subgoal generation and tactic synthesis. On PutnamBench, Bourbaki (7B) solves 26 problems, achieving new state-of-the-art results with models at this scale.