Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The scarcity of high-quality, diverse mathematical data hampers the performance of medium-scale LLMs in formal theorem proving. Method: This paper proposes a progressive training framework tailored for theorem proving: (1) constructing a large-scale, multi-domain mathematical corpus and formalization pipeline; (2) designing a chain-of-thought (CoT)-enhanced state prediction task, integrated with continual pretraining, supervised fine-tuning (SFT), group relative policy optimization (GRPO), and expert iteration; and (3) executing a three-stage collaborative training regimen to enhance reasoning capabilities. Results: The method achieves 37.0% average pass rate on ExamFormal-Bench, solves 27 problems on PutnamBench, and attains 24.0% pass@32 on CombiBench—substantially outperforming existing open-source models. These results validate the effectiveness and scalability of our data construction strategy and training paradigm for formal reasoning.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover-X1-7B achieves state-of-the-art performance among similarly-sized open-source models, attaining a 37.0% average pass rate (pass@32). It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. Both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, are made publicly available at:https://www.modelscope.cn/organization/iflytek, https://gitcode.com/ifly_opensource.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited formal reasoning in LLMs due to scarce training data
Enhancing theorem proving capabilities in moderately-sized language models
Improving automated formal reasoning through diverse mathematical data training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage training framework for lightweight LLMs
CoT-augmented state prediction for fine-grained reasoning
Group Relative Policy Optimization for challenging problems
🔎 Similar Papers
No similar papers found.
X
Xinyuan Zhou
iFlytek Research
Y
Yi Lei
iFlytek Research
Xiaoyu Zhou
Xiaoyu Zhou
Peking University
Computer VisionAutonomous DrivingAI Security
Jingyi Sun
Jingyi Sun
University of Copenhagen
ExplainabilityInterpretabilityNLP
Y
Yu Zhu
iFlytek Research
Z
Zhongyi Ye
iFlytek Research
W
Weitai Zhang
iFlytek Research
Q
Quan Liu
iFlytek Research
S
Si Wei
iFlytek Research
C
Cong Liu
iFlytek Research