Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the limitations of existing process reward models (PRMs), which rely on costly and error-prone human annotations and are largely confined to mathematical domains, thereby lacking the fine-grained feedback required for general reasoning tasks. To overcome these challenges, the study introduces a novel paradigm that integrates automated planning with PRM training. Specifically, logical problems are formalized using the Planning Domain Definition Language (PDDL), and large-scale datasets comprising millions of reasoning steps are generated via automated planning algorithms. This approach yields a cross-domain, scalable framework for PRM training. Experimental results demonstrate significant performance gains across multiple mathematical and non-mathematical reasoning benchmarks, confirming the effectiveness and generalizability of planning-generated data in enhancing model reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Process Reward Models (PRMs) have emerged as a powerful tool for providing step-level feedback when evaluating the reasoning of Large Language Models (LLMs), which frequently produce chains of thought (CoTs) containing errors even when the final answer is correct. However, existing PRM datasets remain expensive to construct, prone to annotation errors, and predominantly limited to the mathematical domain. This work introduces a novel and scalable approach to PRM dataset generation based on planning logical problems expressed in the Planning Domain Definition Language (PDDL). Using this method, we generate a corpus of approximately one million reasoning steps across various PDDL domains and use it to train PRMs. Experimental results show that augmenting widely-used PRM training datasets with PDDL-derived data yields substantial improvements in both mathematical and non-mathematical reasoning, as demonstrated across multiple benchmarks. These findings indicate that planning problems constitute a scalable and effective resource for generating robust, precise, and fine-grained training data for PRMs, going beyond the classical mathematical sources that dominate this field.

Problem

Research questions and friction points this paper is trying to address.

Process Reward Models

step-level rewards

dataset generation

reasoning evaluation

annotation scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Models

Planning Domain Definition Language

Step-level Rewards