Hilbert: Recursively Building Formal Proofs with Informal Reasoning

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Large language models (LLMs) frequently generate unverifiable errors in mathematical reasoning, while formal proof assistants (e.g., Lean 4) ensure rigorous correctness but lack the problem-solving capability of general-purpose LLMs. Method: We propose a proxy-based collaborative framework that synergizes informal reasoning with formal verification. It recursively decomposes problems into verifiable subgoals and dynamically orchestrates an LLM, a Lean 4 tactic generator, a formal verifier, and a semantic theorem retriever, iteratively refining proof attempts using verification feedback. Contribution/Results: This work achieves the first deep integration of informal and formal reasoning paradigms, substantially narrowing the capability gap. On miniF2F, our method attains 99.2% accuracy—setting a new state-of-the-art (+6.6 percentage points). On PutnamBench, it solves 70.0% of problems (462/660), outperforming SeedProver by 422% in solved problems.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) demonstrate impressive mathematical reasoning abilities, but their solutions frequently contain errors that cannot be automatically verified. Formal theorem proving systems such as Lean 4 offer automated verification with complete accuracy, motivating recent efforts to build specialized prover LLMs that generate verifiable proofs in formal languages. However, a significant gap remains: current prover LLMs solve substantially fewer problems than general-purpose LLMs operating in natural language. We introduce Hilbert, an agentic framework that bridges this gap by combining the complementary strengths of informal reasoning and formal verification. Our system orchestrates four components: an informal LLM that excels at mathematical reasoning, a specialized prover LLM optimized for Lean 4 tactics, a formal verifier, and a semantic theorem retriever. Given a problem that the prover is unable to solve, Hilbert employs recursive decomposition to split the problem into subgoals that it solves with the prover or reasoner LLM. It leverages verifier feedback to refine incorrect proofs as necessary. Experimental results demonstrate that Hilbert substantially outperforms existing approaches on key benchmarks, achieving 99.2% on miniF2F, 6.6% points above the best publicly available method. Hilbert achieves the best known result on PutnamBench. It solves 462/660 problems (70.0%), outperforming proprietary approaches like SeedProver (50.4%) and achieving a 422% improvement over the best publicly available baseline. Thus, Hilbert effectively narrows the gap between informal reasoning and formal proof generation.

Problem

Research questions and friction points this paper is trying to address.

Bridging the gap between informal reasoning and formal proof generation

Automating verifiable mathematical proofs using LLMs and Lean 4

Solving more problems than existing methods through recursive decomposition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines informal reasoning with formal verification

Uses recursive decomposition to split problems

Leverages verifier feedback to refine proofs

🔎 Similar Papers

A SIMPLIFIED LOWER BOUND FOR IMPLICATIONAL LOGIC