APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries

📅 2025-04-27

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing theorem-proving benchmarks focus on static, isolated tasks and fail to capture the iterative, engineering-intensive workflows characteristic of real-world formal mathematics libraries (e.g., Mathlib4). Method: We propose Automated Proof Engineering (APE) as a new paradigm and introduce the first file-level benchmark grounded in authentic commit histories, covering feature addition, proof refactoring, and bug fixing. We design a novel evaluation framework combining natural-language problem descriptions with hybrid verification—integrating the Lean compiler and an LLM-as-a-Judge mechanism—and develop Eleanstic, a scalable parallel verification system supporting multiple Mathlib versions. Contribution/Results: Experiments reveal that mainstream LLMs perform well on local edits but suffer substantial performance degradation on cross-file, complex engineering tasks. This work establishes foundational data, methodology, and evaluation infrastructure for APE research.

Technology Category

Application Category

📝 Abstract

Recent progress in large language models (LLMs) has shown promise in formal theorem proving, yet existing benchmarks remain limited to isolated, static proof tasks, failing to capture the iterative, engineering-intensive workflows of real-world formal mathematics libraries. Motivated by analogous advances in software engineering, we introduce the paradigm of Automated Proof Engineering (APE), which aims to automate proof engineering tasks such as feature addition, proof refactoring, and bug fixing using LLMs. To facilitate research in this direction, we present APE-Bench I, the first realistic benchmark built from real-world commit histories of Mathlib4, featuring diverse file-level tasks described in natural language and verified via a hybrid approach combining the Lean compiler and LLM-as-a-Judge. We further develop Eleanstic, a scalable parallel verification infrastructure optimized for proof checking across multiple versions of Mathlib. Empirical results on state-of-the-art LLMs demonstrate strong performance on localized edits but substantial degradation on handling complex proof engineering. This work lays the foundation for developing agentic workflows in proof engineering, with future benchmarks targeting multi-file coordination, project-scale verification, and autonomous agents capable of planning, editing, and repairing formal libraries.

Problem

Research questions and friction points this paper is trying to address.

Automating proof engineering tasks like feature addition and bug fixing

Addressing limitations of static benchmarks in formal theorem proving

Enabling scalable verification for real-world formal mathematics libraries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Proof Engineering (APE) paradigm using LLMs

APE-Bench I benchmark from Mathlib4 commit histories

Eleanstic scalable parallel verification infrastructure

🔎 Similar Papers

A Semantic Search Engine for Mathlib4