PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

This work addresses the challenges of sparse rewards, isolated credit assignment, and distributional shift faced by large language model–based search agents in long-horizon reinforcement learning. To this end, the authors propose PiCA, a novel mechanism that introduces the concept of a “pivot step” to model search trajectories as cumulative progress processes. By defining a potential-based reward shaping (PBRS) signal grounded in the success probability of historical contexts, PiCA enables trajectory-aware dense credit assignment while preserving consistency with the original generation distribution and effectively aligning intermediate steps with the ultimate goal. Experimental results demonstrate that PiCA significantly outperforms strong baselines across seven knowledge-intensive question-answering benchmarks, achieving absolute improvements of 15.2% and 2.2% for 3B and 7B models, respectively, and exhibiting strong generalization capabilities.

📝 Abstract

Large Language Model (LLM)-based search agents trained with reinforcement learning (RL) have significantly improved the performance of knowledge-intensive tasks. However, existing methods encounter critical challenges in long-horizon credit assignment: (i) Reward Sparsity, where models receive only outcome feedback without step-level guidance to differentiate action quality; (ii) Isolated Credit, where credit is assigned to steps independently, failing to capture sequential dependencies; and (iii) Distributional Shift, where rewards are estimated on templates that deviate from the model's natural generative distribution. To address these issues, we propose Pivot-Based Credit Assignment (PiCA), a novel step reward mechanism that reformulates the search trajectory as a sequential process of cumulative search progress. Unlike prior isolated step rewards, PiCA defines process rewards as success probabilities dependent on the historical context based on Potential-Based Reward Shaping (PBRS). This approach identifies pivot steps, which comprise target golden sub-queries and sub-answers derived from historical trajectories, as information peaks that significantly boost the likelihood of a correct final answer. By anchoring these step rewards to the final task objective, PiCA provides dense, pivot-aware and trajectory-dependent guidance while maintaining distributional consistency. Extensive experiments show that PiCA outperforms existing strong baselines across seven knowledge-intensive QA benchmarks, achieving 15.2% and 2.2% improvements for 3B and 7B models. The consistent performance gains across various models show PiCA's robust generalization. The code is available at https://github.com/novdream/PiCA.

Problem

Research questions and friction points this paper is trying to address.

credit assignment

reward sparsity

sequential dependencies

distributional shift

search agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pivot-Based Credit Assignment

Potential-Based Reward Shaping

Search Agentic Reinforcement Learning