🤖 AI Summary
Large language models (LLMs) exhibit poor generalization and limited robustness in formal theorem proving. To address this, we propose the first data augmentation framework grounded in symmetry modeling and controllable difficulty evolution. Our method characterizes theorem symmetry from both syntactic—via abstract syntax tree (AST) transformations—and semantic perspectives—leveraging LLM-driven cross-domain transfer. We further introduce an instruction-guided difficulty evolution mechanism to systematically generate high-quality, diverse theorem variants. This approach significantly enhances training data quality and model robustness. Using this framework, we train EvolProver, a 7B-parameter non-reasoning LLM, which achieves state-of-the-art 53.8% pass@32 on FormalMATH-Lite and outperforms existing reasoning and non-reasoning models across multiple formal theorem proving benchmarks.
📝 Abstract
Large Language Models (LLMs) for formal theorem proving have shown significant promise, yet they often lack generalizability and are fragile to even minor transformations of problem statements. To address this limitation, we introduce a novel data augmentation pipeline designed to enhance model robustness from two perspectives: symmetry and difficulty. From the symmetry perspective, we propose two complementary methods: EvolAST, an Abstract Syntax Tree (AST) based approach that targets syntactic symmetry to generate semantically equivalent problem variants, and EvolDomain, which leverages LLMs to address semantic symmetry by translating theorems across mathematical domains. From the difficulty perspective, we propose EvolDifficulty, which uses carefully designed evolutionary instructions to guide LLMs in generating new theorems with a wider range of difficulty. We then use the evolved data to train EvolProver, a 7B-parameter non-reasoning theorem prover. EvolProver establishes a new state-of-the-art (SOTA) on FormalMATH-Lite with a 53.8% pass@32 rate, surpassing all models of comparable size, including reasoning-based models. It also sets new SOTA records for non-reasoning models on MiniF2F-Test (69.8% pass@32), Ineq-Comp-Seed (52.2% pass@32), and Ineq-Comp-Transformed (34.0% pass@32). Ablation studies further confirm our data augmentation pipeline's effectiveness across multiple benchmarks.