Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Project-level formal verification data for proof-oriented programming (e.g., F*) is scarce—characterized by insufficient high-quality corpora and inadequate modeling of complex, multi-step reasoning. Method: We propose the first synthetic data augmentation framework tailored for project-level proof generation and repair. It innovatively integrates language-specific pretraining, multi-source reasoning elicitation, and intra-repository proof evolution modeling, establishing a dual-path “generation + repair” paradigm. Our approach encompasses synthetic data synthesis, cross-language reasoning knowledge distillation, instruction fine-tuning, and reinforcement learning–based output refinement. Contribution/Results: Evaluated on the 14B-parameter model PoPilot, our method achieves a 64% improvement over GPT-4o on project-level proof tasks; further repairing GPT-4o’s outputs yields an additional 54% gain. This significantly enhances large language models’ capabilities in formal reasoning and code verification.

Technology Category

Application Category

📝 Abstract

Existing LMs struggle with proof-oriented programming due to data scarcity, which manifest in two key ways: (1) a lack of sufficient corpora for proof-oriented programming languages such as F*, and (2) the absence of large-scale, project-level proof-oriented implementations that can teach the model the intricate reasoning process when performing proof-oriented programming. We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair. Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language; incorporating diverse coding data for reasoning capability elicitation and creating new proofs and repair data within existing repositories. This approach enables language models to both synthesize and repair proofs for function- and repository-level code. We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level proof-oriented programming by 64% relative margin, and can improve GPT-4o's performance by 54% by repairing its outputs over GPT-4o's self-repair.

Problem

Research questions and friction points this paper is trying to address.

Addresses data scarcity in proof-oriented programming

Enhances LMs' proficiency in F* and similar languages

Improves proof synthesis and repair at project level

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data augmentation for proof-oriented programming

Fine-tuned 14B parameter model PoPilot

Improved GPT-4o performance by 54%

🔎 Similar Papers

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?