ProcessBench: Identifying Process Errors in Mathematical Reasoning

📅 2024-12-09

🏛️ arXiv.org

📈 Citations: 48

✨ Influential: 13

career value

169K/year

🤖 AI Summary

This work addresses the challenge of automatically localizing erroneous steps in mathematical reasoning by large language models (LLMs). We introduce ProcessBench, the first fine-grained error localization benchmark—comprising 3,400 competition-level problems with human-annotated error-starting steps—and systematically define and evaluate models’ ability to detect early-stage errors in chain-of-thought reasoning. Methodologically, we integrate human-annotated data, process reward model (PRM) fine-tuning, and stepwise critique prompting. Experiments reveal that mainstream PRMs underperform significantly relative to critique prompting on high-difficulty problems; QwQ-32B-Preview achieves near-GPT-4o-level critique capability but lags slightly behind o1-mini. This study establishes the first quantifiable framework for pinpointing reasoning errors in mathematical problem solving, uncovers fundamental generalization bottlenecks of PRMs, and demonstrates the efficacy and scalability of lightweight, prompt-based critique strategies.

Technology Category

Application Category

📝 Abstract

As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.

Problem

Research questions and friction points this paper is trying to address.

Identifying errors in mathematical reasoning steps

Evaluating models on competition-level math problems

Comparing performance of process reward and critic models

Innovation

Methods, ideas, or system contributions that make the work stand out.

ProcessBench dataset for error identification

Process reward models and critic models

Fine-tuned PRM on PRM800K dataset

🔎 Similar Papers

Evaluating Mathematical Reasoning Beyond Accuracy