π€ AI Summary
Large language models (LLMs) exhibit limited geometric reasoning capabilities under two key constraints: scarcity of explicit supervision (e.g., reliance solely on natural language descriptions) and weak proficiency in formal proof languages (e.g., Lean).
Method: This paper introduces a lemma-style full-proof reasoning framework featuring a closed-loop optimization mechanism that integrates Lean-based formal verification feedback, reuse of proven lemmas, and self-summarization. It further designs a test-time inference strategy jointly optimizing depth (long-chain reasoning + reinforcement learning) and breadth (multi-path exploration), and develops Seed-Geometryβa domain-specific formal reasoning engine for geometry.
Results: The approach achieves 78.1% formal proof success on the IMO Formalization Benchmark, fully saturates MiniF2F, attains >50% on PutnamBench, and successfully generated fully automated formal proofs for 5 out of 6 problems in the IMO 2025 competition.
π Abstract
LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose extbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine extbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.