🤖 AI Summary
This work addresses the challenge of fully automated theorem formalization, which requires simultaneous optimization of formal validity, logical fidelity, mathematical consistency, and syntactic quality—dimensions often addressed in isolation by existing approaches that also typically rely on reference standards. To overcome these limitations, we propose a reference-free, monotonically improving iterative optimization framework that leverages complementary feedback from a theorem prover and a panel of multi-role large language models (LLMs). Our approach introduces a novel response mapping mechanism to guide each LLM role toward targeted refinements and incorporates an acceptance strategy with convergence criteria that guarantee monotonic performance improvement. Experimental results demonstrate that our method achieves 93.44% formal validity and 78.22% overall score on miniF2F, and 44.09% formal validity with 29.79% overall score on ProofNet, establishing new state-of-the-art performance without reference-guided supervision.
📝 Abstract
While statement autoformalization has advanced rapidly, full-theorem autoformalization remains largely unexplored. Existing iterative refinement methods in statement autoformalization typicall improve isolated aspects of formalization, such as syntactic correctness, but struggle to jointly optimizing multiple quality dimensions, which is critical for full-theorem autoformalization. We introduce a reference-free iterative monotonic process for full-theorem autoformalization that leverages complementary feedback from theorem provers and LLM-based judges, without access to ground-truth proofs or existing formalizations at inference time. Our approach optimizes a masked composite objective over Formal Validity, Logical Preservation, Mathematical Consistency, and Formal Quality, guided by a responsiveness map that indicates how different LLMs acting as different roles preferentially improve each dimension. We further propose an acceptance policy that guarantees certified monotonic improvement, and provide conditions ensuring convergence and termination. Empirical experiments demonstrate the proposed process enables simultaneous improvement across multiple dimensions, achieving 93.44% formal validity and a 78.22% overall score on miniF2F, and 44.09% formal validity and a 29.79% overall score on ProofNet.