DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Traditional automated theorem proving (ATP) relies on formal logical systems, creating a semantic gap with the informal natural language knowledge acquired by large language models (LLMs) during pretraining—thereby limiting their mathematical reasoning capabilities. Method: We introduce the first large-scale, IMO-level informal theorem-proving benchmark comprising 121K theorem-proof pairs; propose RL-Zero, a novel reinforcement learning framework enabling end-to-end training via informal proof verification variants; and design a dual-dimensional evaluation metric assessing both conclusion correctness and reasoning-step quality. Contribution/Results: We construct a rigorously annotated, cross-domain, verifiable high-quality dataset. Our approach significantly improves theorem-proving accuracy and reasoning-step quality across multiple state-of-the-art LLMs, achieving new SOTA performance. This advances LLMs’ capacity for mathematical exploration and practical automated reasoning.

Technology Category

Application Category

📝 Abstract

Theorem proving serves as a major testbed for evaluating complex reasoning abilities in large language models (LLMs). However, traditional automated theorem proving (ATP) approaches rely heavily on formal proof systems that poorly align with LLMs' strength derived from informal, natural language knowledge acquired during pre-training. In this work, we propose DeepTheorem, a comprehensive informal theorem-proving framework exploiting natural language to enhance LLM mathematical reasoning. DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs spanning diverse mathematical domains, rigorously annotated for correctness, difficulty, and topic categories, accompanied by systematically constructed verifiable theorem variants. We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference. Additionally, we propose comprehensive outcome and process evaluation metrics examining proof correctness and the quality of reasoning steps. Extensive experimental analyses demonstrate DeepTheorem significantly improves LLM theorem-proving performance compared to existing datasets and supervised fine-tuning protocols, achieving state-of-the-art accuracy and reasoning quality. Our findings highlight DeepTheorem's potential to fundamentally advance automated informal theorem proving and mathematical exploration.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning for theorem proving using natural language

Addressing misalignment between formal ATP systems and LLM strengths

Improving accuracy and reasoning quality in informal theorem proving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses natural language for LLM theorem proving

Introduces RL-Zero for informal theorem proving

Provides large-scale annotated theorem dataset

🔎 Similar Papers

No similar papers found.