Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

πŸ“… 2026-03-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the vulnerability of existing retrieval-augmented agents to noisy retrieval during multi-step reasoning and the limitation of conventional reinforcement learning, which provides only outcome-level rewards that inadequately guide intermediate reasoning steps. To overcome these challenges, the authors propose explicitly modeling retrieval quality assessment as an action within the agent’s decision process. They introduce a Search-to-Evaluate protocol that generates structured scores for each retrieval step and pioneer the integration of self-evaluation directly into the reasoning trajectory, thereby constructing a process-aligned reward signal. Additionally, they present a Piecewise Advantage Rescaling (PCAR) method to enhance policy learning efficiency. The approach achieves state-of-the-art average accuracy across seven open-domain question answering benchmarks, with particularly significant gains on multi-hop tasks. Ablation studies confirm the effectiveness of both the self-evaluation mechanism and PCAR.

Technology Category

Application Category

πŸ“ Abstract
Retrieval-augmented agents can query external evidence, yet their reliability in multi-step reasoning remains limited: noisy retrieval may derail multi-hop question answering, while outcome-only reinforcement learning provides credit signals that are too coarse to optimize intermediate steps. We propose \textsc{EvalAct} (Evaluate-as-Action), which converts implicit retrieval quality assessment into an explicit action and enforces a coupled Search-to-Evaluate protocol so that each retrieval is immediately followed by a structured evaluation score, yielding process signals aligned with the interaction trajectory. To leverage these signals, we introduce Process-Calibrated Advantage Rescaling (PCAR), a GRPO-based optimization method that rescales advantages at the segment level according to evaluation scores, emphasizing reliable segments while updating uncertain ones conservatively. Experiments on seven open-domain QA benchmarks show that \textsc{EvalAct} achieves the best average accuracy, with the largest gains on multi-hop tasks, and ablations verify that the explicit evaluation loop drives the primary improvements while PCAR provides consistent additional benefits.
Problem

Research questions and friction points this paper is trying to address.

retrieval-augmented agents
multi-hop reasoning
noisy retrieval
coarse credit assignment
process rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluate-as-Action
Process Rewards
Retrieval-Augmented Agents
Multi-hop Reasoning
Advantage Rescaling
πŸ”Ž Similar Papers
No similar papers found.
J
Jiangming Shu
School of Computer Science and Technology, Beijing Jiaotong University
Yuxiang Zhang
Yuxiang Zhang
Beijing Jiaotong University
Ye Ma
Ye Ma
Hithink RoyalFlush, University of Liverpool, XJTLU
Xueyuan Lin
Xueyuan Lin
PhD Student, HKUST(GZ) & IDEA
natrual language processingreinforcement learninggraph neural network
J
Jitao Sang
School of Computer Science and Technology, Beijing Jiaotong University