๐ค AI Summary
This work addresses the credit assignment challenge in long-horizon agent reasoning caused by sparse outcome-based rewards. To this end, it proposes the Verifiable Process Reward (VPR) framework, which establishes a dense reward mechanism grounded in reliable intermediate verificationsโa first systematic approach of its kind. VPR leverages verifiable oracles, including symbolic, algorithmic, or post-hoc validators (categorized as search-, constraint-, and posterior-based), to convert intermediate decisions into episode-level reinforcement signals, thereby providing more localized learning supervision in theory. Empirical results demonstrate that VPR significantly outperforms both outcome-only rewards and rollout-based process reward baselines in controlled settings, and further generalizes effectively to both general-purpose and agent-specific reasoning benchmarks, substantially enhancing the reasoning capabilities of large language model agents.
๐ Abstract
Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existing approaches rely on sparse outcome-level feedback. This sparsity creates a credit assignment challenge in long-horizon agentic reasoning: a trajectory may fail despite containing many correct intermediate decisions, or succeed despite containing flawed ones. In this work, we study a class of densely-verifiable agentic reasoning problems, where intermediate actions can be objectively checked by symbolic or algorithmic oracles. We propose Verifiable Process Rewards (VPR), a framework that converts such oracles into dense turn-level supervision for reinforcement learning, and instantiate it in three representative settings: search-based verification for dynamic deduction, constraint-based verification for logical reasoning, and posterior-based verification for probabilistic inference. We further provide a theoretical analysis showing that dense verifier-grounded rewards can improve long-horizon credit assignment by providing more localized learning signals, with the benefit depending on the reliability of the verifier. Empirically, VPR outperforms outcome-level reward and rollout-based process reward baselines across controlled environments, and more importantly, transfers to both general and agentic reasoning benchmarks, suggesting that verifiable process supervision can foster general reasoning skills applicable beyond the training environments. Our results indicate that VPR is a promising approach for enhancing LLM agents whenever reliable intermediate verification is available, while also highlighting its dependence on oracle quality and the open challenge of extending VPR to less structured, open-ended environments.