Theorem Prover as a Judge for Synthetic Data Generation

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the dual challenges of low-quality synthetic data and difficulty in validating intermediate reasoning steps in mathematical reasoning. To this end, we propose a theorem-prover-as-judge generation paradigm. Our method comprises three core components: (1) an iterative automated formalization mechanism that progressively improves the executability rate of LLM-generated reasoning chains in the Lean theorem prover; (2) the TP-as-a-Judge framework, enabling fine-grained formal verification of individual reasoning steps; and (3) RLTPF, a reinforcement learning algorithm that leverages formal feedback—eliminating reliance on human annotations. Empirically, our approach raises Lean execution rate from 60% to 87%. With only 3,508 training samples, it achieves accuracy gains of +5.56%, +6.00%, and +3.55% on MultiArith, SVAMP, and AQUA, respectively, significantly enhancing the robustness and generalization capability of mathematical language models.

Technology Category

Application Category

📝 Abstract
The demand for synthetic data in mathematical reasoning has increased due to its potential to enhance the mathematical capabilities of large language models (LLMs). However, ensuring the validity of intermediate reasoning steps remains a significant challenge, affecting data quality. While formal verification via theorem provers effectively validates LLM reasoning, the autoformalisation of mathematical proofs remains error-prone. In response, we introduce iterative autoformalisation, an approach that iteratively refines theorem prover formalisation to mitigate errors, thereby increasing the execution rate on the Lean prover from 60% to 87%. Building upon that, we introduce Theorem Prover as a Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to rigorously assess LLM intermediate reasoning, effectively integrating autoformalisation with synthetic data generation. Finally, we present Reinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that replaces human annotation with theorem prover feedback in Reinforcement Learning from Human Feedback (RLHF). Across multiple LLMs, applying TP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving 5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for SVAMP, and 3.55% on Llama-3.1-8B for AQUA.
Problem

Research questions and friction points this paper is trying to address.

Ensuring validity of synthetic data steps
Improving theorem prover formalisation accuracy
Enhancing LLM reasoning with theorem prover feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative autoformalisation refines theorem prover
TP-as-a-Judge assesses LLM reasoning rigorously
RLTPF replaces human with theorem feedback
🔎 Similar Papers
No similar papers found.