Conjecturing: An Overlooked Step in Formal Mathematical Reasoning

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Prior LLM-based automated formalization research overlooks “conjecture generation” — a critical prerequisite step — leading to inflated performance estimates and the absence of dedicated evaluation frameworks. Method: We formally model conjecture generation as an independent task, introduce ConjectureBench — the first benchmark dataset specifically designed for mathematical conjectures — and propose a disentangled evaluation paradigm. Our approach incorporates enhanced data construction strategies, fine-grained conjecture quality metrics, and a reasoning-time optimization technique, Lean-FIRe. Results: Experiments demonstrate that GPT-4.1 and DeepSeek-V3.1 achieve end-to-end automated formalization on 13 and 7 problems, respectively, in PutnamBench — surpassing prior zero-solution results — thereby empirically validating the decisive role of conjecture generation. This work establishes a novel two-stage “conjecture → formalization” paradigm, providing both theoretical foundations and practical methodologies for trustworthy mathematical AI.

Technology Category

Application Category

📝 Abstract

Autoformalisation, the task of expressing informal mathematical statements in formal language, is often viewed as a direct translation process. This, however, disregards a critical preceding step: conjecturing. Many mathematical problems cannot be formalised directly without first conjecturing a conclusion such as an explicit answer, or a specific bound. Since Large Language Models (LLMs) already struggle with autoformalisation, and the evaluation of their conjecturing ability is limited and often entangled within autoformalisation or proof, it is particularly challenging to understand its effect. To address this gap, we augment existing datasets to create ConjectureBench, and redesign the evaluation framework and metric specifically to measure the conjecturing capabilities of LLMs both as a distinct task and within the autoformalisation pipeline. Our evaluation of foundational models, including GPT-4.1 and DeepSeek-V3.1, reveals that their autoformalisation performance is substantially overestimated when the conjecture is accounted for during evaluation. However, the conjecture should not be assumed to be provided. We design an inference-time method, Lean-FIRe to improve conjecturing and autoformalisation, which, to the best of our knowledge, achieves the first successful end-to-end autoformalisation of 13 PutnamBench problems with GPT-4.1 and 7 with DeepSeek-V3.1. We demonstrate that while LLMs possess the requisite knowledge to generate accurate conjectures, improving autoformalisation performance requires treating conjecturing as an independent task, and investigating further how to correctly integrate it within autoformalisation. Finally, we provide forward-looking guidance to steer future research toward improving conjecturing, an overlooked step of formal mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' conjecturing ability in mathematical reasoning separately

Improving autoformalisation by treating conjecturing as independent task

Addressing performance overestimation when conjecture is not provided

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created ConjectureBench dataset for evaluating LLM conjecturing ability

Designed Lean-FIRe method to improve conjecturing and autoformalisation

Treats conjecturing as independent task within mathematical reasoning pipeline

🔎 Similar Papers

No similar papers found.