MathlibPR: Pull Request Merge-Readiness Benchmark for Formal Mathematical Libraries

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

187K/year
🤖 AI Summary
This work addresses the challenge of efficiently determining whether pull requests (PRs) in Mathlib meet the criteria for merging under its manual review process. To this end, it introduces MathlibPR, the first benchmark dataset derived from real-world PR histories in Mathlib4, repurposed as supervised signals. The study proposes a staged evaluation protocol to systematically assess the capability of large language models—including DeepSeek, Qwen, Goedel, and Kimina—and coding agents such as Codex and Claude Code in judging PR merge-readiness. Experimental results reveal that current models struggle to distinguish between PRs that are immediately mergeable and those that pass basic checks yet are ultimately revised or rejected, highlighting the task’s inherent difficulty and laying the groundwork for future development of code review assistance tools and reward models.
📝 Abstract
The ecosystem of Lean and Mathlib has become the de facto standard for large language model (LLM) assisted formal reasoning with remarkable successes in recent years. Those successes, however, only consume Mathlib as an essential dependency but do not directly contribute to it. In the meantime, the growth of Mathlib has recently been bottlenecked by the review process, which requires human reviewers to judge whether proposed pull requests (PRs) follow the Mathlib's conventions and are worth integrating as part of a shared mathematical infrastructure. This leads to our central question: can LLMs help review Mathlib PRs? To this end, we introduce MathlibPR, a benchmark built from real Mathlib4 PR histories. We further propose a staged evaluation protocol and use it to evaluate both LLM models (e.g., DeepSeek, Qwen, Goedel, and Kimina) and LLM agents (e.g., Codex and Claude Code). Surprisingly, both LLM models and LLM agents struggle to distinguish merge-ready PRs from build-passing PRs that were revised or never merged. By turning Mathlib PR histories into a supervised signal, MathlibPR provides a step toward reviewer assistants and reward models that could help evaluate PRs and steer LLMs toward producing merge-ready Mathlib contributions.
Problem

Research questions and friction points this paper is trying to address.

Mathlib
Pull Request Review
Formal Mathematical Libraries
LLM-assisted Reasoning
Merge-Readiness
Innovation

Methods, ideas, or system contributions that make the work stand out.

MathlibPR
formal mathematical libraries
pull request review
LLM evaluation benchmark
merge-readiness
🔎 Similar Papers
2024-03-20Conference on Empirical Methods in Natural Language ProcessingCitations: 1