Proof-RM: A Scalable and Generalizable Reward Model for Math Proof

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing methods struggle to automatically verify the correctness of complex mathematical proofs due to their inability to rely on simple answer matching. This work proposes a scalable data generation and training framework that leverages large language models to automatically produce diverse “problem–proof–verification” triplets, complemented by hierarchical human review to ensure label consistency. To enhance reinforcement learning stability, the framework incorporates a process-based reward mechanism and token-level weighting. The resulting Proof-RM significantly outperforms baseline models in reward accuracy, generalization, and its ability to guide reasoning, offering a new, efficient paradigm for improving mathematical reasoning in large language models with minimal reliance on human annotations.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with *Verifiable Rewards* (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching. To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required. In this work, we design a *scalable* data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality"**question-proof-check**"triplet data. By systematically varying problem sources, generation methods, and model configurations, we create diverse problem-proof pairs spanning multiple difficulty levels, linguistic styles, and error types, subsequently filtered through hierarchical human review for label alignment. Utilizing these data, we train a proof-checking RM, incorporating additional process reward and token weight balance to stabilize the RL process. Our experiments validate the model's scalability and strong performance from multiple perspectives, including reward accuracy, generalization ability and test-time guidance, providing important practical recipes and tools for strengthening LLM mathematical capabilities.

Problem

Research questions and friction points this paper is trying to address.

reward model

mathematical proof

automatic verification

large language models

proof evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward Model

Mathematical Proof

Scalable Data Construction