Online Process Reward Leanring for Agentic Reinforcement Learning

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Temporal credit assignment for long-horizon agents remains challenging under sparse or unverifiable rewards. Method: This paper proposes Online Process Reward Learning (OPRL), an implicit process reward model that requires no additional sampling or human annotation; it automatically decomposes trajectory-level preferences into step-level rewards and jointly optimizes them with outcome rewards to guide policy updates. Theoretically, the model satisfies both reward consistency and potential-based reward shaping, enabling a stable self-amplifying training loop. OPRL jointly optimizes the policy and process reward model via a trajectory-based DPO objective and employs low-variance policy gradients leveraging both step-level and episode-level advantage functions. Results: On benchmarks including WebShop, VisualSokoban, and SOTOPIA, OPRL significantly outperforms state-of-the-art LLMs and RL methods, achieving superior sample efficiency, lower gradient variance, enhanced exploration capability, and new SOTA performance.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly trained with reinforcement learning (RL) as autonomous agents that reason and act over long horizons in interactive environments. However, sparse and sometimes unverifiable rewards make temporal credit assignment extremely challenging. Recent work attempts to integrate process supervision into agent learning but suffers from biased annotation, reward hacking, high-variance from overly fine-grained signals or failtures when state overlap is rare. We therefore introduce Online Process Reward Learning (OPRL), a general credit-assignment strategy for agentic RL that integrates seamlessly with standard on-policy algorithms without relying on additional rollouts or explicit step labels. In OPRL, we optimize an implicit process reward model (PRM) alternately with the agent's policy to transform trajectory preferences into implicit step rewards through a trajectory-based DPO objective. These step rewards are then used to compute step-level advantages, which are combined with episode-level advantages from outcome rewards for policy update, creating a self-reinforcing loop. Theoretical findings guarantee that the learned step rewards are consistent with trajectory preferences and act as potential-based shaping rewards, providing bounded gradients to stabilize training. Empirically, we evaluate OPRL on three distinct agent benmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverfiable rewards in SOTOPIA. Crucially, OPRL shows superior performance over frontier LLMs and strong RL baselines across domains, achieving state-of-the-art results with higher sample-efficiency and lower variance during training. Further analysis also demonstrates the efficient exploration by OPRL using fewer actions, underscoring its potential for agentic learning in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Sparse and unverifiable rewards challenge temporal credit assignment in agentic RL
Existing process supervision methods suffer from biased annotation and reward hacking
Fine-grained signals cause high variance and fail with rare state overlap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online Process Reward Learning integrates implicit reward modeling
Combines trajectory preferences with outcome rewards for policy updates
Uses potential-based shaping rewards to stabilize training gradients
🔎 Similar Papers
No similar papers found.