A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Large language models (LLMs) have long relied on outcome-based reward modeling—evaluating only final answers—leading to insufficient interpretability and robustness in reasoning processes. Method: This paper systematically introduces the Process Reward Modeling (PRM) paradigm, shifting supervision from outcome-level to step- or trajectory-level reasoning evaluation. We establish a comprehensive methodology encompassing data construction, fine-grained reward modeling, test-time scaling, and RLHF integration. Contribution/Results: We empirically validate PRM across diverse domains—including mathematics, code generation, natural language, multimodal reasoning, and robotic agent tasks. To our knowledge, this is the first work to characterize the design space and core challenges of PRMs across multiple domains, releasing a cross-task benchmark, practical implementation guidelines, and open-source resources. Our framework provides foundational theoretical insights, actionable technical pathways, and empirical support for trustworthy reasoning alignment in LLMs.

Technology Category

Application Category

📝 Abstract

Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.

Problem

Research questions and friction points this paper is trying to address.

Developing process reward models for step-level reasoning evaluation

Addressing limitations of outcome-only reward models in LLM alignment

Providing systematic framework for process supervision across domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Models evaluate step-level reasoning

PRMs guide reasoning via reinforcement learning scaling

Survey covers process data generation and application benchmarks

🔎 Similar Papers

No similar papers found.