CuDIP: Enhancing Theorem Proving in LLMs via Curriculum Learning-based Direct Preference Optimization

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the misalignment between large language models (LLMs) and human reasoning preferences in automated theorem proving (ATP), exacerbated by the scarcity of high-quality preference data. We introduce Direct Preference Optimization (DPO) to ATP for the first time and propose CuDIP, a curriculum-learning-driven iterative DPO framework. To mitigate reliance on manual annotation, we design an automated method for generating high-quality preference pairs using LLMs and existing formal proof corpora. Evaluated on MiniF2F and ProofNet, our approach achieves substantial improvements in proof success rates, demonstrating that preference alignment meaningfully enhances mathematical reasoning. Our key contributions are: (1) the first application of DPO to ATP; (2) a fully automated, annotation-free paradigm for preference data generation; and (3) a curriculum-based iterative optimization framework that progressively aligns LLMs’ formal reasoning capabilities with expert human preferences.

Technology Category

Application Category

📝 Abstract

Automated theorem proving (ATP) is one of the most challenging mathematical reasoning tasks for Large Language Models (LLMs). Most existing LLM-based ATP methods rely on supervised fine-tuning, which results in a limited alignment between the theorem proving process and human preferences. Direct Preference Optimization (DPO), which aligns LLMs with human preferences, has shown positive effects for certain tasks. However, the lack of high-quality preference data for theorem proving presents a significant challenge. In this paper, we innovatively apply DPO to formal automated theorem proving and introduces a Curriculum Learning-based DPO Iterative Theorem Proving (CuDIP) method. Specifically, we propose a method for constructing preference data which utilizes LLMs and existing theorem proving data to enhance the diversity of the preference data while reducing the reliance on human preference annotations. We then integrate this preference data construction method with curriculum learning to iteratively fine-tune the theorem proving model through DPO. Experimental results on the MiniF2F and ProofNet datasets demonstrate the effectiveness of the proposed method.

Problem

Research questions and friction points this paper is trying to address.

Enhancing theorem proving in LLMs

Aligning LLMs with human preferences

Reducing reliance on human annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum Learning-based DPO

Constructing preference data via LLMs

Iterative fine-tuning for theorem proving

🔎 Similar Papers

LeanAgent: Lifelong Learning for Formal Theorem Proving