🤖 AI Summary
This work addresses a critical limitation in existing automated theorem proving (ATP) benchmarks, which embed conclusions within formal statements (“Easy Mode”) and thus fail to evaluate a model’s ability to independently discover theorems. To remedy this, we introduce the “Hard Mode” setting, requiring systems to first autonomously conjecture a theorem before constructing a formal proof. We present DAP, an open-source agent framework that leverages large language models (LLMs) for natural-language reasoning and self-reflection to generate conjectures, then translates them into Lean 4–verifiable formal statements for ATP solvers. We define and implement Hard Mode for the first time, releasing the MiniF2F-Hard and FIMO-Hard benchmarks. Experiments show LLMs achieve over 80% accuracy in conjecture generation but under 10% success in formal proof construction, while solving 10 problems on CombiBench and 36 on PutnamBench—significantly advancing the frontier of automated theorem proving.
📝 Abstract
Most ATP benchmarks embed the final answer within the formal statement -- a convention we call "Easy Mode" -- a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting "Hard Mode": the system must independently discover the answer before constructing a formal proof. To enable Hard Mode research, we make two contributions. First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks. Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers. DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode -- while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%, exposing a substantial gap that Hard Mode benchmarks are uniquely suited to measure.