Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the vulnerability of existing retrieval-augmented agents to noisy retrieval during multi-step reasoning and the limitation of conventional reinforcement learning, which provides only outcome-level rewards that inadequately guide intermediate reasoning steps. To overcome these challenges, the authors propose explicitly modeling retrieval quality assessment as an action within the agent’s decision process. They introduce a Search-to-Evaluate protocol that generates structured scores for each retrieval step and pioneer the integration of self-evaluation directly into the reasoning trajectory, thereby constructing a process-aligned reward signal. Additionally, they present a Piecewise Advantage Rescaling (PCAR) method to enhance policy learning efficiency. The approach achieves state-of-the-art average accuracy across seven open-domain question answering benchmarks, with particularly significant gains on multi-hop tasks. Ablation studies confirm the effectiveness of both the self-evaluation mechanism and PCAR.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented agents can query external evidence, yet their reliability in multi-step reasoning remains limited: noisy retrieval may derail multi-hop question answering, while outcome-only reinforcement learning provides credit signals that are too coarse to optimize intermediate steps. We propose \textsc{EvalAct} (Evaluate-as-Action), which converts implicit retrieval quality assessment into an explicit action and enforces a coupled Search-to-Evaluate protocol so that each retrieval is immediately followed by a structured evaluation score, yielding process signals aligned with the interaction trajectory. To leverage these signals, we introduce Process-Calibrated Advantage Rescaling (PCAR), a GRPO-based optimization method that rescales advantages at the segment level according to evaluation scores, emphasizing reliable segments while updating uncertain ones conservatively. Experiments on seven open-domain QA benchmarks show that \textsc{EvalAct} achieves the best average accuracy, with the largest gains on multi-hop tasks, and ablations verify that the explicit evaluation loop drives the primary improvements while PCAR provides consistent additional benefits.

Problem

Research questions and friction points this paper is trying to address.

retrieval-augmented agents

multi-hop reasoning

noisy retrieval

coarse credit assignment

process rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluate-as-Action

Process Rewards

Retrieval-Augmented Agents