Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether reinforcement learning (RL) necessarily requires an explicit Process Reward Model (PRM) to enhance large language models’ (LLMs) mathematical reasoning capabilities. Method: We conduct pure RL training on DeepSeek-R1 and QwQ-32B—without any external PRM—and observe that both models spontaneously develop implicit PRM capabilities, i.e., internalized process-level supervision. Building on this, we propose Self-PRM, a self-reflection framework wherein the model autonomously generates multiple solutions, evaluates its own reasoning steps, and re-ranks outputs—fully end-to-end and without external reward signals. Contribution/Results: Our experiments demonstrate that pure RL significantly strengthens implicit process supervision; Self-PRM improves accuracy across most mathematical reasoning tasks, yet remains limited on high-difficulty problems (accuracy <10%), exposing the need for finer-grained reward modeling and scaling. Crucially, this work empirically validates the feasibility of RL-induced *endogenous* PRM capability and presents the first fully self-supervised, end-to-end process optimization framework.

Technology Category

Application Category

📝 Abstract
The development of reasoning capabilities represents a critical frontier in large language models (LLMs) research, where reinforcement learning (RL) and process reward models (PRMs) have emerged as predominant methodological frameworks. Contrary to conventional wisdom, empirical evidence from DeepSeek-R1 demonstrates that pure RL training focused on mathematical problem-solving can progressively enhance reasoning abilities without PRM integration, challenging the perceived necessity of process supervision. In this study, we conduct a systematic investigation of the relationship between RL training and PRM capabilities. Our findings demonstrate that problem-solving proficiency and process supervision capabilities represent complementary dimensions of reasoning that co-evolve synergistically during pure RL training. In particular, current PRMs underperform simple baselines like majority voting when applied to state-of-the-art models such as DeepSeek-R1 and QwQ-32B. To address this limitation, we propose Self-PRM, an introspective framework in which models autonomously evaluate and rerank their generated solutions through self-reward mechanisms. Although Self-PRM consistently improves the accuracy of the benchmark (particularly with larger sample sizes), analysis exposes persistent challenges: The approach exhibits low precision (<10%) on difficult problems, frequently misclassifying flawed solutions as valid. These analyses underscore the need for continued RL scaling to improve reward alignment and introspective accuracy. Overall, our findings suggest that PRM may not be essential for enhancing complex reasoning, as pure RL not only improves problem-solving skills but also inherently fosters robust PRM capabilities. We hope these findings provide actionable insights for building more reliable and self-aware complex reasoning models.
Problem

Research questions and friction points this paper is trying to address.

RL training enhances reasoning without PRM integration
Current PRMs underperform on state-of-the-art models
Self-PRM improves accuracy but has low precision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pure RL training enhances reasoning without PRM
Self-PRM autonomously evaluates and reranks solutions
RL scaling improves reward alignment and accuracy
🔎 Similar Papers
No similar papers found.
Z
Zhangying Feng
Huawei Technologies Ltd.
Q
Qianglong Chen
Huawei Technologies Ltd.
N
Ning Lu
HKUST
Yongqian Li
Yongqian Li
Unknown affiliation
S
Siqi Cheng
Huawei Technologies Ltd.
S
Shuangmu Peng
Huawei Technologies Ltd.
Duyu Tang
Duyu Tang
Huawei
Natural Language Processing
Shengcai Liu
Shengcai Liu
Southern University of Science and Technology
Learn to OptimizeLLM+Optimization
Z
Zhirui Zhang
Huawei Technologies Ltd.