π€ AI Summary
This work addresses the vulnerability of existing retrieval-augmented agents to noisy retrieval during multi-step reasoning and the limitation of conventional reinforcement learning, which provides only outcome-level rewards that inadequately guide intermediate reasoning steps. To overcome these challenges, the authors propose explicitly modeling retrieval quality assessment as an action within the agentβs decision process. They introduce a Search-to-Evaluate protocol that generates structured scores for each retrieval step and pioneer the integration of self-evaluation directly into the reasoning trajectory, thereby constructing a process-aligned reward signal. Additionally, they present a Piecewise Advantage Rescaling (PCAR) method to enhance policy learning efficiency. The approach achieves state-of-the-art average accuracy across seven open-domain question answering benchmarks, with particularly significant gains on multi-hop tasks. Ablation studies confirm the effectiveness of both the self-evaluation mechanism and PCAR.
π Abstract
Retrieval-augmented agents can query external evidence, yet their reliability in multi-step reasoning remains limited: noisy retrieval may derail multi-hop question answering, while outcome-only reinforcement learning provides credit signals that are too coarse to optimize intermediate steps. We propose \textsc{EvalAct} (Evaluate-as-Action), which converts implicit retrieval quality assessment into an explicit action and enforces a coupled Search-to-Evaluate protocol so that each retrieval is immediately followed by a structured evaluation score, yielding process signals aligned with the interaction trajectory. To leverage these signals, we introduce Process-Calibrated Advantage Rescaling (PCAR), a GRPO-based optimization method that rescales advantages at the segment level according to evaluation scores, emphasizing reliable segments while updating uncertain ones conservatively. Experiments on seven open-domain QA benchmarks show that \textsc{EvalAct} achieves the best average accuracy, with the largest gains on multi-hop tasks, and ablations verify that the explicit evaluation loop drives the primary improvements while PCAR provides consistent additional benefits.