🤖 AI Summary
Project-level formal verification data for proof-oriented programming (e.g., F*) is scarce—characterized by insufficient high-quality corpora and inadequate modeling of complex, multi-step reasoning. Method: We propose the first synthetic data augmentation framework tailored for project-level proof generation and repair. It innovatively integrates language-specific pretraining, multi-source reasoning elicitation, and intra-repository proof evolution modeling, establishing a dual-path “generation + repair” paradigm. Our approach encompasses synthetic data synthesis, cross-language reasoning knowledge distillation, instruction fine-tuning, and reinforcement learning–based output refinement. Contribution/Results: Evaluated on the 14B-parameter model PoPilot, our method achieves a 64% improvement over GPT-4o on project-level proof tasks; further repairing GPT-4o’s outputs yields an additional 54% gain. This significantly enhances large language models’ capabilities in formal reasoning and code verification.
📝 Abstract
Existing LMs struggle with proof-oriented programming due to data scarcity, which manifest in two key ways: (1) a lack of sufficient corpora for proof-oriented programming languages such as F*, and (2) the absence of large-scale, project-level proof-oriented implementations that can teach the model the intricate reasoning process when performing proof-oriented programming. We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair. Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language; incorporating diverse coding data for reasoning capability elicitation and creating new proofs and repair data within existing repositories. This approach enables language models to both synthesize and repair proofs for function- and repository-level code. We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level proof-oriented programming by 64% relative margin, and can improve GPT-4o's performance by 54% by repairing its outputs over GPT-4o's self-repair.