🤖 AI Summary
This work addresses the poor generalization and severe exposure bias of large language models (LLMs) in formal theorem proving. Methodologically, it proposes a fine-tuning-free reinforcement learning framework wherein the LLM serves as a policy network, optimized via a PPO variant within formal environments such as Lean. The approach integrates policy rollouts with explicit comparison against an expected policy, jointly orchestrating policy sampling, reward modeling, and multi-step tactic generation—thereby circumventing the reliance on expert-annotated proof trajectories inherent to supervised fine-tuning. Its core contribution is the first introduction of explicit policy rollout versus expected policy comparison into LLM-based theorem proving training, which substantially mitigates exposure bias. Experiments on standard benchmarks demonstrate statistically significant improvements in proof success rate over supervised fine-tuning baselines, validating that reinforcement-driven policy search enhances both reasoning robustness and out-of-distribution generalization.
📝 Abstract
To take advantage of Large Language Model in theorem formalization and proof, we propose a reinforcement learning framework to iteratively optimize the pretrained LLM by rolling out next tactics and comparing them with the expected ones. The experiment results show that it helps to achieve a higher accuracy compared with directly fine-tuned LLM.