LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work aims to enhance native formal reasoning capabilities in Lean 4, focusing on three core tasks: autoformalization, proof sketch generation, and complete proof construction. To this end, the authors propose the Tool-Integrated Reasoning (TIR) framework, equipped with a mixture-of-experts iterative training mechanism and a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm. By integrating gradient masking and theorem validity checking, the approach effectively stabilizes training of a 560-billion-parameter MoE model for long-horizon reasoning and mitigates reward hacking. Experimental results demonstrate that the method achieves a 97.1% pass rate on MiniF2F-Test with only 72 inference attempts per problem, and solves 70.8% and 41.5% of problems on ProverBench and PutnamBench, respectively, substantially outperforming existing open-source models.

Technology Category

Application Category

📝 Abstract

We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.

Problem

Research questions and friction points this paper is trying to address.

Native Formal Reasoning

Auto-formalization

Theorem Proving

Mixture-of-Experts

Long-horizon Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Tool-Integrated Reasoning

Mixture-of-Experts

Hierarchical Importance Sampling Policy Optimization