TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow

📅 2026-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of open-source formalized mathematical data, which stems from the high cost of agent-based workflows. To overcome this limitation, we propose a low-cost, scalable data synthesis framework that decomposes formalization into five subtasks: statement formalization, proof generation, premise selection, proof refinement, and proof sketching. We further introduce a decoupled extraction strategy that recovers valuable training signals from failed trajectories, substantially improving data utilization and output efficiency. Using a staged agent workflow based on the Gemini-3-Flash model, our approach achieves a 12.6% validation rate on a 2,000-problem benchmark—outperforming the 8.6% baseline—while reducing the average cost per successful trajectory to just \$0.481, yielding a 1.6× improvement in proof generation efficiency.

Technology Category

Application Category

📝 Abstract
The high cost of agentic workflows in formal mathematics hinders large-scale data synthesis, exacerbating the scarcity of open-source corpora. To address this, we introduce \textbf{TheoremForge}, a cost-effective formal data synthesis pipeline that decomposes the formalization process into five sub-tasks, which are \textit{statement formalization}, \textit{proof generation}, \textit{premise selection}, \textit{proof correction} and \textit{proof sketching}. By implementing a \textit{Decoupled Extraction Strategy}, the workflow recovers valid training signals from globally failed trajectories, effectively utilizing wasted computation. Experiments on a 2,000-problem benchmark demonstrate that TheoremForge achieves a Verified Rate of 12.6\%, surpassing the 8.6\% baseline, at an average cost of only \textbf{\$0.481} per successful trajectory using Gemini-3-Flash. Crucially, our strategy increases data yield by \textbf{1.6$\times$} for proof generation compared to standard filtering. These results establish TheoremForge as a scalable framework for constructing a data flywheel to train future expert models. Our code is available \href{https://github.com/timechess/TheoremForge}{here}.
Problem

Research questions and friction points this paper is trying to address.

formal mathematics
data synthesis
agentic workflow
open-source corpora
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

formal data synthesis
agentic workflow
decoupled extraction strategy
theorem proving
cost-efficient AI
🔎 Similar Papers
No similar papers found.
Yicheng Tao
Yicheng Tao
Carnegie Mellon Univeristy
Natural Language ProcessingSmart Cities
H
Hongteng Xu
Gaoling School of Artificial Intelligence, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE