🤖 AI Summary
This study addresses the fundamental gap between informal set-theoretic intuition in natural language mathematics and the rigorous type-theoretic foundations of Lean 4, a discrepancy that often leads large language models to generate non-compilable or semantically distorted formalizations. To systematically evaluate strategies for bridging this gap, the work introduces a factorial experimental design that isolates and assesses the individual and synergistic effects of three tool-augmentation mechanisms: fine-tuned model querying, symbolic knowledge retrieval, and feedback from the Lean REPL compiler. The results demonstrate that this approach substantially improves both compilation success rates and semantic equivalence of generated code, while also quantifying the marginal contribution of each tool to overall performance and uncovering the underlying mechanisms that enable effective tool synergy.
📝 Abstract
Automatic translation of natural language mathematics into faithful Lean 4 code is hindered by the fundamental dissonance between informal set-theoretic intuition and strict formal type theory. This gap often causes LLMs to hallucinate non-existent library definitions, resulting in code that fails to compile or lacks semantic fidelity. In this work, we investigate the effectiveness of tool-augmented agents for this task through a systematic factorial analysis of three distinct tool categories: Fine-tuned Model Querying (accessing expert drafts), Knowledge Search (retrieving symbol definitions), and Compiler Feedback (verifying code via a Lean REPL). We first benchmark the agent against one-shot baselines, demonstrating large gains in both compilation success and semantic equivalence. We then use the factorial decomposition to quantify the impact of each category, isolating the marginal contribution of each tool type to overall performance.