Don't Eliminate Cut: Exponential Separations in LLM-Based Theorem Proving

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work aims to bridge the gap between the empirical success of large language models in formal theorem proving and their theoretical worst-case complexity, with a focus on the critical role of cut (or lemma) structures in proof efficiency. We model interactive theorem proving as a policy learning problem within a finite-horizon deterministic Markov decision process, introducing a distribution over proof DAGs with latent variables to capture reusable subgoal structures. By integrating top-k search with an analysis under the Tsybakov noise condition, we characterize the probability of successful proof synthesis. Our key theoretical contribution establishes, for the first time, that hierarchical learners preserving cut structures achieve exponential sample efficiency gains over flat, cut-free learners, thereby providing a rigorous theoretical foundation for subgoal decomposition in automated reasoning.

Technology Category

Application Category

📝 Abstract

We develop a theoretical analysis of LLM-guided formal theorem proving in interactive proof assistants (e.g., Lean) by modeling tactic proposal as a stochastic policy in a finite-horizon deterministic MDP. To capture modern representation learning, we treat the state and action spaces as general compact metric spaces and assume Lipschitz policies. To explain the gap between worst-case hardness and empirical success, we introduce problem distributions generated by a reference policy $q$, including a latent-variable model in which proofs exhibit reusable cut/lemma/sketch structure represented by a proof DAG. Under a top-$k$ search protocol and Tsybakov-type margin conditions, we derive lower bounds on finite-horizon success probability that decompose into search and learning terms, with learning controlled by sequential Rademacher/covering complexity. Our main separation result shows that when cut elimination expands a DAG of depth $D$ into a cut-free tree of size $\Omega(\Lambda^D)$ while the cut-aware hierarchical process has size $O(\lambda^D)$ with $\lambda\ll\Lambda$, a flat (cut-free) learner provably requires exponentially more data than a cut-aware hierarchical learner. This provides a principled justification for subgoal decomposition in recent agentic theorem provers.

Problem

Research questions and friction points this paper is trying to address.

theorem proving

cut elimination

proof complexity

hierarchical learning

interactive proof assistants

Innovation

Methods, ideas, or system contributions that make the work stand out.

cut elimination

hierarchical learning

proof DAG