🤖 AI Summary
Large language models (LLMs) have long relied on outcome-based reward modeling—evaluating only final answers—leading to insufficient interpretability and robustness in reasoning processes. Method: This paper systematically introduces the Process Reward Modeling (PRM) paradigm, shifting supervision from outcome-level to step- or trajectory-level reasoning evaluation. We establish a comprehensive methodology encompassing data construction, fine-grained reward modeling, test-time scaling, and RLHF integration. Contribution/Results: We empirically validate PRM across diverse domains—including mathematics, code generation, natural language, multimodal reasoning, and robotic agent tasks. To our knowledge, this is the first work to characterize the design space and core challenges of PRMs across multiple domains, releasing a cross-task benchmark, practical implementation guidelines, and open-source resources. Our framework provides foundational theoretical insights, actionable technical pathways, and empirical support for trustworthy reasoning alignment in LLMs.
📝 Abstract
Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.