Verifiable Process Rewards for Agentic Reasoning

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the credit assignment challenge in long-horizon agent reasoning caused by sparse outcome-based rewards. To this end, it proposes the Verifiable Process Reward (VPR) framework, which establishes a dense reward mechanism grounded in reliable intermediate verifications—a first systematic approach of its kind. VPR leverages verifiable oracles, including symbolic, algorithmic, or post-hoc validators (categorized as search-, constraint-, and posterior-based), to convert intermediate decisions into episode-level reinforcement signals, thereby providing more localized learning supervision in theory. Empirical results demonstrate that VPR significantly outperforms both outcome-only rewards and rollout-based process reward baselines in controlled settings, and further generalizes effectively to both general-purpose and agent-specific reasoning benchmarks, substantially enhancing the reasoning capabilities of large language model agents.

📝 Abstract

Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existing approaches rely on sparse outcome-level feedback. This sparsity creates a credit assignment challenge in long-horizon agentic reasoning: a trajectory may fail despite containing many correct intermediate decisions, or succeed despite containing flawed ones. In this work, we study a class of densely-verifiable agentic reasoning problems, where intermediate actions can be objectively checked by symbolic or algorithmic oracles. We propose Verifiable Process Rewards (VPR), a framework that converts such oracles into dense turn-level supervision for reinforcement learning, and instantiate it in three representative settings: search-based verification for dynamic deduction, constraint-based verification for logical reasoning, and posterior-based verification for probabilistic inference. We further provide a theoretical analysis showing that dense verifier-grounded rewards can improve long-horizon credit assignment by providing more localized learning signals, with the benefit depending on the reliability of the verifier. Empirically, VPR outperforms outcome-level reward and rollout-based process reward baselines across controlled environments, and more importantly, transfers to both general and agentic reasoning benchmarks, suggesting that verifiable process supervision can foster general reasoning skills applicable beyond the training environments. Our results indicate that VPR is a promising approach for enhancing LLM agents whenever reliable intermediate verification is available, while also highlighting its dependence on oracle quality and the open challenge of extending VPR to less structured, open-ended environments.

Problem

Research questions and friction points this paper is trying to address.

credit assignment

agentic reasoning

verifiable rewards

reinforcement learning

intermediate verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Verifiable Process Rewards

dense supervision

credit assignment