rePIRL: Learn PRM with Inverse RL for LLM Reasoning

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing process reward model (PRM) approaches either rely on strong expert assumptions or are prone to entropy collapse, limiting their generalization. This work proposes rePIRL, a novel framework that introduces inverse reinforcement learning into PRM learning for large language model (LLM) reasoning. By alternately optimizing the policy and the PRM, rePIRL establishes a dual-learning mechanism that unifies online and offline training under weak expert assumptions and effectively mitigates entropy collapse. Experiments demonstrate that rePIRL significantly outperforms existing methods on mathematical and code reasoning tasks. The learned PRMs prove effective for test-time training, test-time scaling, and early signal generation on challenging problems. Ablation studies further confirm the contribution of each component in the proposed design.

Technology Category

Application Category

📝 Abstract

Process rewards have been widely used in deep reinforcement learning to improve training efficiency, reduce variance, and prevent reward hacking. In LLM reasoning, existing works also explore various solutions for learning effective process reward models (PRM) with or without the help of an expert policy. However, existing methods either rely on strong assumptions about the expert policies (e.g., requiring their reward functions) or suffer intrinsic limitations (e.g., entropy collapse), resulting in weak PRMs or limited generalizability. In this paper, we introduce rePIRL, an inverse RL-inspired framework that learns effective PRMs with minimal assumptions about expert policies. Specifically, we design a dual learning process that updates the policy and the PRM interchangeably. Our learning algorithm has customized techniques to address the challenges of scaling traditional inverse RL to LLMs. We theoretically show that our proposed learning framework can unify both online and offline PRM learning methods, justifying that rePIRL can learn PRMs with minimal assumptions. Empirical evaluations on standardized math and coding reasoning datasets demonstrate the effectiveness of rePIRL over existing methods. We further show the application of our trained PRM in test-time training, test-time scaling, and providing an early signal for training hard problems. Finally, we validate our training recipe and key design choices via a detailed ablation study.

Problem

Research questions and friction points this paper is trying to address.

process reward model

large language models

inverse reinforcement learning

expert policy assumptions

reward hacking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Model

Inverse Reinforcement Learning

Large Language Models