🤖 AI Summary
This work aims to bridge the gap between the empirical success of large language models in formal theorem proving and their theoretical worst-case complexity, with a focus on the critical role of cut (or lemma) structures in proof efficiency. We model interactive theorem proving as a policy learning problem within a finite-horizon deterministic Markov decision process, introducing a distribution over proof DAGs with latent variables to capture reusable subgoal structures. By integrating top-k search with an analysis under the Tsybakov noise condition, we characterize the probability of successful proof synthesis. Our key theoretical contribution establishes, for the first time, that hierarchical learners preserving cut structures achieve exponential sample efficiency gains over flat, cut-free learners, thereby providing a rigorous theoretical foundation for subgoal decomposition in automated reasoning.
📝 Abstract
We develop a theoretical analysis of LLM-guided formal theorem proving in interactive proof assistants (e.g., Lean) by modeling tactic proposal as a stochastic policy in a finite-horizon deterministic MDP. To capture modern representation learning, we treat the state and action spaces as general compact metric spaces and assume Lipschitz policies. To explain the gap between worst-case hardness and empirical success, we introduce problem distributions generated by a reference policy $q$, including a latent-variable model in which proofs exhibit reusable cut/lemma/sketch structure represented by a proof DAG. Under a top-$k$ search protocol and Tsybakov-type margin conditions, we derive lower bounds on finite-horizon success probability that decompose into search and learning terms, with learning controlled by sequential Rademacher/covering complexity. Our main separation result shows that when cut elimination expands a DAG of depth $D$ into a cut-free tree of size $\Omega(\Lambda^D)$ while the cut-aware hierarchical process has size $O(\lambda^D)$ with $\lambda\ll\Lambda$, a flat (cut-free) learner provably requires exponentially more data than a cut-aware hierarchical learner. This provides a principled justification for subgoal decomposition in recent agentic theorem provers.