🤖 AI Summary
Existing process reward models (PRMs) rely heavily on human-annotated step-level error labels and are restricted to mathematical reasoning tasks. Method: We propose FoVer, the first framework to integrate formal verification tools—Z3 (SMT solver) and Isabelle (interactive theorem prover)—into PRM data construction, enabling fully automated synthesis of step-level correctness labels without human annotation. FoVer extends PRMs beyond mathematics to diverse reasoning domains, including formal logic and theorem proving. Results: On ProcessBench, FoVer-trained PRMs achieve significantly higher step-level verification accuracy than baselines. In Best-of-K evaluation, FoVer matches or surpasses state-of-the-art methods relying on human or strong-model annotations across 12 benchmarks—including MATH, AIME, and ANLI—demonstrating robust cross-task generalization of PRMs.
📝 Abstract
Process Reward Models (PRMs), which provide step-by-step feedback on the reasoning generated by Large Language Models (LLMs), are receiving increasing attention. However, two key research gaps remain: collecting accurate step-level error labels for training typically requires costly human annotation, and existing PRMs are limited to math reasoning problems. In response to these gaps, this paper aims to address the challenges of automatic dataset creation and the generalization of PRMs to diverse reasoning tasks. To achieve this goal, we propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools, such as Z3 for formal logic and Isabelle for theorem proof, which provide automatic and accurate verification for symbolic tasks. Using this approach, we synthesize a training dataset with error labels on LLM responses for formal logic and theorem proof tasks without human annotation. Although this data synthesis is feasible only for tasks compatible with formal verification, we observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks. Specifically, PRMs trained with FoVer significantly outperform baseline PRMs based on the original LLMs and achieve competitive or superior results compared to state-of-the-art PRMs trained on labels annotated by humans or stronger models, as measured by step-level verification on ProcessBench and Best-of-K performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU, and BBH. The datasets, models, and code are provided at https://github.com/psunlpgroup/FoVer.