A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) have long relied on outcome-based reward modeling—evaluating only final answers—leading to insufficient interpretability and robustness in reasoning processes. Method: This paper systematically introduces the Process Reward Modeling (PRM) paradigm, shifting supervision from outcome-level to step- or trajectory-level reasoning evaluation. We establish a comprehensive methodology encompassing data construction, fine-grained reward modeling, test-time scaling, and RLHF integration. Contribution/Results: We empirically validate PRM across diverse domains—including mathematics, code generation, natural language, multimodal reasoning, and robotic agent tasks. To our knowledge, this is the first work to characterize the design space and core challenges of PRMs across multiple domains, releasing a cross-task benchmark, practical implementation guidelines, and open-source resources. Our framework provides foundational theoretical insights, actionable technical pathways, and empirical support for trustworthy reasoning alignment in LLMs.

Technology Category

Application Category

📝 Abstract
Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.
Problem

Research questions and friction points this paper is trying to address.

Developing process reward models for step-level reasoning evaluation
Addressing limitations of outcome-only reward models in LLM alignment
Providing systematic framework for process supervision across domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Models evaluate step-level reasoning
PRMs guide reasoning via reinforcement learning scaling
Survey covers process data generation and application benchmarks
🔎 Similar Papers
No similar papers found.