RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the credit assignment challenge in interactive retrieval, where reasoning steps are unobservable, by proposing a critic-free policy optimization framework. The method uniquely leverages the interaction structure between agent and environment to automatically generate local supervision signals, using high-uncertainty actions as anchors. It effectively assigns credit to implicit reasoning steps through counterfactual evaluation based on retrieval metrics, analysis of the influence strength of reasoning on actions, and detection of future residual stability. Evaluated on the BRIGHT and BEIR benchmarks, the approach significantly outperforms prompt-engineered agents and group-based reinforcement learning methods, achieving superior retrieval performance under identical retriever configurations.

📝 Abstract

Retrieval is increasingly moving from one-shot matching toward interactive reasoning, where language agents iteratively inspect evidence, reformulate queries, and search again. Training such agents raises a credit-assignment challenge: executable actions such as queries or summaries can be directly evaluated by the retriever, while latent reasoning steps are not directly observable and only affect future executable actions. This asymmetry makes outcome-level reward assignment unreliable, as the same final reward may credit reasoning steps that did not actually shape retrieval success. We propose RICE-PO, a critic-free policy optimization framework that converts retrieval interactions into localized learning signals. RICE-PO selects high-uncertainty executable actions as anchors, evaluates local counterfactual branches using retrieval metrics, and propagates credit to latent reasoning steps only when reasoning-to-action influence is strong and future residual effects are stable. On BRIGHT and BEIR, RICE-PO consistently outperforms prompt-based agents and group-based RL baselines under the same retriever setting. These results show that the structure of agent-environment interaction itself can provide useful supervision for training reasoning-based retrieval agents.

Problem

Research questions and friction points this paper is trying to address.

credit assignment

interactive retrieval

reasoning agents

retrieval interaction

latent reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

credit assignment

reasoning agents

retrieval interaction