🤖 AI Summary
Existing theorem-proving benchmarks focus on static, isolated tasks and fail to capture the iterative, engineering-intensive workflows characteristic of real-world formal mathematics libraries (e.g., Mathlib4).
Method: We propose Automated Proof Engineering (APE) as a new paradigm and introduce the first file-level benchmark grounded in authentic commit histories, covering feature addition, proof refactoring, and bug fixing. We design a novel evaluation framework combining natural-language problem descriptions with hybrid verification—integrating the Lean compiler and an LLM-as-a-Judge mechanism—and develop Eleanstic, a scalable parallel verification system supporting multiple Mathlib versions.
Contribution/Results: Experiments reveal that mainstream LLMs perform well on local edits but suffer substantial performance degradation on cross-file, complex engineering tasks. This work establishes foundational data, methodology, and evaluation infrastructure for APE research.
📝 Abstract
Recent progress in large language models (LLMs) has shown promise in formal theorem proving, yet existing benchmarks remain limited to isolated, static proof tasks, failing to capture the iterative, engineering-intensive workflows of real-world formal mathematics libraries. Motivated by analogous advances in software engineering, we introduce the paradigm of Automated Proof Engineering (APE), which aims to automate proof engineering tasks such as feature addition, proof refactoring, and bug fixing using LLMs. To facilitate research in this direction, we present APE-Bench I, the first realistic benchmark built from real-world commit histories of Mathlib4, featuring diverse file-level tasks described in natural language and verified via a hybrid approach combining the Lean compiler and LLM-as-a-Judge. We further develop Eleanstic, a scalable parallel verification infrastructure optimized for proof checking across multiple versions of Mathlib. Empirical results on state-of-the-art LLMs demonstrate strong performance on localized edits but substantial degradation on handling complex proof engineering. This work lays the foundation for developing agentic workflows in proof engineering, with future benchmarks targeting multi-file coordination, project-scale verification, and autonomous agents capable of planning, editing, and repairing formal libraries.