IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

175K/year
🤖 AI Summary
This work addresses the limitations of existing reinforcement learning–based search-augmented reasoning methods, which rely on sparse trajectory-level rewards that struggle to distinguish informative from redundant queries and suffer from vanishing gradients when entire trajectories fail. The authors propose IG-Search, a novel framework that, for the first time, derives step-level information gain rewards directly from the model’s own output probabilities without requiring intermediate annotations. This reward quantifies the improvement in answer confidence induced by each retrieval step and is combined with token-wise advantage modulation from GRPO to enable fine-grained credit assignment. IG-Search eliminates dependence on external supervision or shared environment states, achieving an average exact match (EM) score of 0.430 across seven question-answering benchmarks—1.6 points higher than the strongest trajectory-level baseline, MR-Search, with particularly notable gains on multi-hop tasks—while incurring only a 6.4% increase in training cost and no additional inference latency.

Technology Category

Application Category

📝 Abstract
Reinforcement learning has emerged as an effective paradigm for training large language models to perform search-augmented reasoning. However, existing approaches rely on trajectory-level rewards that cannot distinguish precise search queries from vague or redundant ones within a rollout group, and collapse to a near-zero gradient signal whenever every sampled trajectory fails. In this paper, we propose IG-Search, a reinforcement learning framework that introduces a step-level reward based on Information Gain (IG). For each search step, IG measures how much the retrieved documents improve the model's confidence in the gold answer relative to a counterfactual baseline of random documents, thereby reflecting the effectiveness of the underlying search query. This signal is fed back to the corresponding search-query tokens via per-token advantage modulation in GRPO, enabling fine-grained, step-level credit assignment within a rollout. Unlike prior step-level methods that require either externally annotated intermediate supervision or shared environment states across trajectories, IG-Search derives its signals from the policy's own generation probabilities, requiring no intermediate annotations beyond standard question-answer pairs. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate that IG-Search achieves an average EM of 0.430 with Qwen2.5-3B, outperforming the strongest trajectory-level baseline (MR-Search) by 1.6 points and the step-level method GiGPO by 0.9 points on average across benchmarks, with particularly pronounced gains on multi-hop reasoning tasks. Despite introducing a dense step-level signal, IG-Search adds only ~6.4% to per-step training wall-clock time over the trajectory-level baseline and leaves inference latency unchanged, while still providing a meaningful gradient signal even when every sampled trajectory answers incorrectly.
Problem

Research questions and friction points this paper is trying to address.

search-augmented reasoning
trajectory-level rewards
step-level credit assignment
information gain
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Information Gain
Step-level Reward
Search-Augmented Reasoning
Reinforcement Learning
Credit Assignment
Z
Zihan Liang
Kuaishou Technology
Yufei Ma
Yufei Ma
Peking University
Neural Network AcceleratorComputing-in-MemoryFPGA DesignNeuromorphic Computing
Ben Chen
Ben Chen
KuaiShou, Alibaba, HUST, WHU
MultimodalLLMGenerative RecommendationSemantic Matching
Z
Zhipeng Qian
Kuaishou Technology
H
Huangyu Dai
Kuaishou Technology
L
Lingtao Mao
Kuaishou Technology
X
Xuxin Zhang
Kuaishou Technology
Chenyi Lei
Chenyi Lei
Kuaishou Technology
Recommender SystemInformation RetrievalGenerative RecommendationMultimodal
W
Wenwu Ou
Kuaishou Technology