🤖 AI Summary
This work addresses the challenge of accurately evaluating the contribution of individual steps in information-seeking tasks, where trajectory-level rewards are often insufficient and step-level approaches typically rely on computationally expensive tree sampling. The authors propose modeling world knowledge as an implicit entity-relation graph and framing the search process as a traversal toward the answer node within this graph. Based on this formulation, they introduce an efficient credit assignment mechanism that eliminates the need for tree sampling. The core innovations include a Graph Distance-based Contribution Reward (GDCR) and a Step Advantage Policy Optimization (SAPO) algorithm. Experimental results demonstrate that the proposed method significantly outperforms existing approaches across four challenging information-seeking benchmarks, confirming its effectiveness and generalizability.
📝 Abstract
In Agentic Search, trajectory-level outcome rewards fail to quantify the behavioral contributions of individual steps, while existing step-level reward methods typically rely on costly tree sampling. We view world knowledge as a latent world graph and each IS task as search within a latent task graph, where effective steps should make graph progress toward the answer node. Based on this prior, we propose Graph-Distance Contribution Reward (GDCR), a step-level process reward that scores newly-retrieved and newly-cited entities by their distance to the answer node in a training-time Entity-Relation (ER) graph. We further propose Step Advantage Policy Optimization (SAPO), which converts GDCR into step-level advantages and combines them with trajectory-level outcome advantages. Experiments on four challenging benchmarks validate the effectiveness of our method.