Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

๐Ÿ“… 2026-03-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Automatically generating formally verifiable proofs remains a significant challenge for large language models. This work proposes a hierarchical proof search framework that recursively decomposes complex proof goals into more manageable subgoals in Lean 4, unifying decomposition and completion strategies within a single automated reasoning pipeline. The authors introduce a novel scoring mechanism that jointly evaluates the constructiveness and structural validity of decompositions, serving both as a training reward and a ranking criterion during inference to align optimization with deployment objectives. Building upon supervised pretraining and hybrid reinforcement learning, they train an 8B-parameter unified policy model that leverages continuous decomposition rewards to guide exploration while incorporating supervised replay for enhanced stability. Evaluated across 427 tasks on three Lean-based benchmarks, the approach achieves a 62.0% success rateโ€”2.6 times higher than the strongest baselineโ€”and surpasses a neural prover with 84 times more parameters, demonstrating exceptional test-time scalability.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) can generate plausible code but offer limited guarantees of correctness. Formally verifying that implementations satisfy specifications requires constructing machine-checkable proofs, a task that remains beyond current automation. We propose a hierarchical proof search framework for automated code verification in Lean~4 that decomposes complex verification goals into structurally simpler subgoals before attempting tactic-level proving. Central to our approach is a principled decomposition score that combines constructive justification with structural effectiveness. Crucially, this score serves as both the training reward and the inference-time ranking criterion, ensuring strict alignment between optimization and deployment. We train Goedel-Code-Prover-8B, a single unified policy for both decomposition and completion, via supervised initialization followed by hybrid reinforcement learning, where a continuous decomposition reward drives planning exploration while supervised replay stabilizes proof generation. On three Lean-based code verification benchmarks comprising 427 tasks, our 8B-parameter model achieves a 62.0\% prove success rate, a 2.6$\times$ improvement over the strongest baseline, surpassing neural provers up to 84$\times$ larger. We further observe consistent inference-time scaling: success rates improve monotonically with search iterations and sampling budget, with our trained model achieving greater efficiency than frontier off-the-shelf models of comparable scale.
Problem

Research questions and friction points this paper is trying to address.

code verification
formal proof
automated reasoning
machine-checkable proofs
program correctness
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical proof search
code verification
decomposition score
reinforcement learning
Lean 4
๐Ÿ”Ž Similar Papers
No similar papers found.