Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit poor performance in multi-step first-order logic (FOL) theorem proving—e.g., Deepseek-Prover-V2-7B achieves only 4.2% accuracy on our newly constructed Lean 4 benchmark of 447 problems—primarily due to monolithic proof strategies and irreversible error propagation in early reasoning steps. Method: We propose DREAM, a novel framework integrating axiom-driven adaptive strategy diversification with sub-propositional error identification and reflective regeneration, enabling fine-grained error correction and robust, cooperative reasoning. Built atop the Lean 4 formal system, DREAM unifies prompt engineering, dynamic strategy control, and hierarchical feedback. Contribution/Results: On our benchmark, DREAM improves accuracy by 0.6–6.4 percentage points over strong baselines. It establishes the first dedicated, scalable evaluation benchmark for multi-step FOL theorem proving and introduces a generalizable methodological paradigm for formal reasoning with LLMs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown promising first-order logic (FOL) reasoning capabilities with applications in various areas. However, their effectiveness in complex mathematical reasoning involving multi-step FOL deductions is still under-researched. While LLMs perform competitively on established mathematical reasoning benchmarks, they struggle with multi-step FOL tasks, as demonstrated by Deepseek-Prover-V2-7B's low accuracy (4.2%) on our proposed theorem proving dataset. This issue arises from the limited exploration of diverse proof strategies and the potential for early reasoning mistakes to undermine entire proofs. To address these issues, we propose DREAM, a self-adaptive solution that enhances the Diversity and REAsonability of LLMs'generation strategies. DREAM incorporates an Axiom-Driven Strategy Diversification mechanism to promote varied strategic outcomes and a Sub-Proposition Error Feedback to help LLMs reflect on and correct their proofs. Our contributions include pioneering advancements in LLMs'mathematical reasoning through FOL theorem proving, introducing a novel inference stage solution that improves performance by 0.6% to 6.4%, and providing a curated dataset of 447 mathematical theorems in Lean 4 format for evaluation.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs' multi-step first-order logic reasoning
Addressing low accuracy in complex theorem proving tasks
Improving diversity and correctness of proof strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Axiom-Driven Strategy Diversification for diverse proofs
Sub-Proposition Error Feedback for proof correction
DREAM enhances LLMs' reasoning diversity and accuracy
🔎 Similar Papers
No similar papers found.
Chuxue Cao
Chuxue Cao
Hong Kong University of Science and Technology
M
Mengze Li
Hong Kong University of Science and Technology
J
Juntao Dai
Peking University
J
Jinluan Yang
Zhejiang University
Z
Zijian Zhao
Hong Kong University of Science and Technology
S
Shengyu Zhang
Zhejiang University
Weijie Shi
Weijie Shi
Hong Kong University of Science and Technology
Chengzhong Liu
Chengzhong Liu
HKUST
Human AI Interaction
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence
Y
Yike Guo
Hong Kong University of Science and Technology