FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work evaluates large language models (LLMs) as mathematical assistants for completing critical subgoals in formalized proofs of machine learning (ML) theory—a task termed “subgoal completion.” Method: We introduce FormalML, the first formal benchmark for foundational ML theory, comprising 4,937 challenging problems spanning optimization, probabilistic inequalities, and related domains. We propose a declarative subgoal completion framework that integrates research-level contextual knowledge and premise retrieval challenges, and systematically translate procedural Lean 4 proofs into declarative form. Contribution/Results: Experiments reveal that state-of-the-art theorem-proving LLMs achieve low accuracy and poor inference efficiency on this task, exposing a critical practical bottleneck in deploying LLMs for higher-order mathematical assistance in ML theory formalization.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have recently demonstrated remarkable progress in formal theorem proving. Yet their ability to serve as practical assistants for mathematicians, filling in missing steps within complex proofs, remains underexplored. We identify this challenge as the task of subgoal completion, where an LLM must discharge short but nontrivial proof obligations left unresolved in a human-provided sketch. To study this problem, we introduce FormalML, a Lean 4 benchmark built from foundational theories of machine learning. Using a translation tactic that converts procedural proofs into declarative form, we extract 4937 problems spanning optimization and probability inequalities, with varying levels of difficulty. FormalML is the first subgoal completion benchmark to combine premise retrieval and complex research-level contexts. Evaluation of state-of-the-art provers highlights persistent limitations in accuracy and efficiency, underscoring the need for more capable LLM-based theorem provers for effective subgoal completion,

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to complete subgoals in proofs

Assessing formal theorem proving in machine learning contexts

Benchmarking premise retrieval with complex research-level problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FormalML benchmark for subgoal completion

Uses translation tactic for declarative proof conversion

Combines premise retrieval with research-level contexts

🔎 Similar Papers

No similar papers found.