Benchmarking Testing in Automated Theorem Proving

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Current large language models lack effective mechanisms for evaluating the semantic correctness of generated formal theorems, often relying on lexical matching or manual inspection, which inadequately captures their practical utility. This work proposes the T framework, which introduces the concept of integration testing to formal theorem evaluation for the first time. By automatically extracting theorem dependencies from real-world Lean 4 repositories, the framework constructs the first large-scale benchmark for semantic correctness that requires no human annotation, using the ability of a generated theorem to support successful compilation of downstream theorems as the evaluation criterion. Experiments on a benchmark comprising 2,206 problems reveal that even the best-performing model achieves only a 38.9% test accuracy—substantially lower than conventional compilation success rates—highlighting significant deficiencies in current models’ capacity for meaningful theorem generation.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection. Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T , a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing. We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort. Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly worse under our semantic metric. The best model, Claude-Sonnet-4.5, achieves only 38.9% Testing Accuracy on the full set, given both natural language proof and successor theorems as context, revealing a critical gap in current theorem generation capabilities.

Problem

Research questions and friction points this paper is trying to address.

automated theorem proving

semantic correctness

evaluation benchmark

formal verification

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-based evaluation

semantic correctness

automated theorem proving