🤖 AI Summary
This work aims to elucidate the empirical success and performance origins of intelligent theorem-proving systems on classical proof search problems. To this end, we model the proving process as a time-constrained Markov decision process and introduce the notion of “statistical provability,” which characterizes—through a distributional lens—the probability that a system generates a valid proof within a bounded number of steps. Leveraging the Bellman equation, we establish the existence of an optimal policy and, for the first time, quantify the performance gap of score-guided planning methods. Through analyses involving sub- and super-solution inequalities, metric entropy, doubling structures, and tail bounds on action gaps, our study not only provides a theoretical justification for the effectiveness of intelligent provers under realistic, biased distributions but also reveals their fundamental limitations in worst-case or adversarial scenarios.
📝 Abstract
Agentic theorem provers -- pipelines that couple a mathematical reasoning model with library retrieval, subgoal-decomposition/search planner, and a proof assistant verifier -- have recently achieved striking empirical success, yet it remains unclear which components drive performance and why such systems work at all despite classical hardness of proof search. We propose a distributional viewpoint and introduce **statistical provability**, defined as the finite-horizon success probability of reaching a verified proof, averaged over an instance distribution, and formalize modern theorem-proving pipelines as time-bounded MDPs. Exploiting Bellman structure, we prove existence of optimal policies under mild regularity, derive provability certificates via sub-/super-solution inequalities, and bound the performance gap of score-guided planning (greedy/top-\(k\)/beam/rollouts) in terms of approximation error, sequential statistical complexity, representation geometry (metric entropy/doubling structure), and action-gap margin tails. Together, our theory provides a principled, component-sensitive explanation of when and why agentic theorem provers succeed on biased real-world problem distributions, while clarifying limitations in worst-case or adversarial regimes.