RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the credit assignment challenge in interactive retrieval, where reasoning steps are unobservable, by proposing a critic-free policy optimization framework. The method uniquely leverages the interaction structure between agent and environment to automatically generate local supervision signals, using high-uncertainty actions as anchors. It effectively assigns credit to implicit reasoning steps through counterfactual evaluation based on retrieval metrics, analysis of the influence strength of reasoning on actions, and detection of future residual stability. Evaluated on the BRIGHT and BEIR benchmarks, the approach significantly outperforms prompt-engineered agents and group-based reinforcement learning methods, achieving superior retrieval performance under identical retriever configurations.
📝 Abstract
Retrieval is increasingly moving from one-shot matching toward interactive reasoning, where language agents iteratively inspect evidence, reformulate queries, and search again. Training such agents raises a credit-assignment challenge: executable actions such as queries or summaries can be directly evaluated by the retriever, while latent reasoning steps are not directly observable and only affect future executable actions. This asymmetry makes outcome-level reward assignment unreliable, as the same final reward may credit reasoning steps that did not actually shape retrieval success. We propose RICE-PO, a critic-free policy optimization framework that converts retrieval interactions into localized learning signals. RICE-PO selects high-uncertainty executable actions as anchors, evaluates local counterfactual branches using retrieval metrics, and propagates credit to latent reasoning steps only when reasoning-to-action influence is strong and future residual effects are stable. On BRIGHT and BEIR, RICE-PO consistently outperforms prompt-based agents and group-based RL baselines under the same retriever setting. These results show that the structure of agent-environment interaction itself can provide useful supervision for training reasoning-based retrieval agents.
Problem

Research questions and friction points this paper is trying to address.

credit assignment
interactive retrieval
reasoning agents
retrieval interaction
latent reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

credit assignment
reasoning agents
retrieval interaction
policy optimization
counterfactual evaluation
🔎 Similar Papers
2024-05-03Annual International ACM SIGIR Conference on Research and Development in Information RetrievalCitations: 2