LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

📅 2026-03-22
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This work aims to enhance native formal reasoning capabilities in Lean 4, focusing on three core tasks: autoformalization, proof sketch generation, and complete proof construction. To this end, the authors propose the Tool-Integrated Reasoning (TIR) framework, equipped with a mixture-of-experts iterative training mechanism and a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm. By integrating gradient masking and theorem validity checking, the approach effectively stabilizes training of a 560-billion-parameter MoE model for long-horizon reasoning and mitigates reward hacking. Experimental results demonstrate that the method achieves a 97.1% pass rate on MiniF2F-Test with only 72 inference attempts per problem, and solves 70.8% and 41.5% of problems on ProverBench and PutnamBench, respectively, substantially outperforming existing open-source models.

Technology Category

Application Category

📝 Abstract
We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.
Problem

Research questions and friction points this paper is trying to address.

Native Formal Reasoning
Auto-formalization
Theorem Proving
Mixture-of-Experts
Long-horizon Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Tool-Integrated Reasoning
Mixture-of-Experts
Hierarchical Importance Sampling Policy Optimization
Native Formal Reasoning
Auto-formalization
🔎 Similar Papers
No similar papers found.
J
Jianing Wang
J
Jianfei Zhang
Q
Qi Guo
L
Linsen Guo
R
Rumei Li
C
Chao Zhang
Chong Peng
Chong Peng
Qingdao University
æœē器å­Ļäš ã€čŽĄįŽ—æœē视觉
C
Cunguang Wang
D
Dengchang Zhao
J
Jiarong Shi
Jingang Wang
Jingang Wang
Meituan
Information RetrievalNatural Language ProcessingMachine Translation
L
Liulin Feng
M
Mengxia Shen
Q
Qi Li
Shengnan An
Shengnan An
Meituan
Model EvaluationReasoningNatural Language ProcessingLarge Language Model
S
Shun Wang
W
Wei Shi
Xiangyu Xi
Xiangyu Xi
Peking University; Meituan Group
natural language processingevent extractioninformation extractiontask-oriented dialogue
X
Xiaoyu Li
Xuezhi Cao
Xuezhi Cao
Meituan
Data MiningKnowledge GraphLLMs
Y
Yi Lu
Y
Yunke Zhao
Z
Zhengyu Chen
Z
Zhimin Lin
W
Wei Wang