IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the limitations of existing reinforcement learning–based search-augmented reasoning methods, which rely on sparse trajectory-level rewards that struggle to distinguish informative from redundant queries and suffer from vanishing gradients when entire trajectories fail. The authors propose IG-Search, a novel framework that, for the first time, derives step-level information gain rewards directly from the model’s own output probabilities without requiring intermediate annotations. This reward quantifies the improvement in answer confidence induced by each retrieval step and is combined with token-wise advantage modulation from GRPO to enable fine-grained credit assignment. IG-Search eliminates dependence on external supervision or shared environment states, achieving an average exact match (EM) score of 0.430 across seven question-answering benchmarks—1.6 points higher than the strongest trajectory-level baseline, MR-Search, with particularly notable gains on multi-hop tasks—while incurring only a 6.4% increase in training cost and no additional inference latency.

Technology Category

Application Category

📝 Abstract

Reinforcement learning has emerged as an effective paradigm for training large language models to perform search-augmented reasoning. However, existing approaches rely on trajectory-level rewards that cannot distinguish precise search queries from vague or redundant ones within a rollout group, and collapse to a near-zero gradient signal whenever every sampled trajectory fails. In this paper, we propose IG-Search, a reinforcement learning framework that introduces a step-level reward based on Information Gain (IG). For each search step, IG measures how much the retrieved documents improve the model's confidence in the gold answer relative to a counterfactual baseline of random documents, thereby reflecting the effectiveness of the underlying search query. This signal is fed back to the corresponding search-query tokens via per-token advantage modulation in GRPO, enabling fine-grained, step-level credit assignment within a rollout. Unlike prior step-level methods that require either externally annotated intermediate supervision or shared environment states across trajectories, IG-Search derives its signals from the policy's own generation probabilities, requiring no intermediate annotations beyond standard question-answer pairs. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate that IG-Search achieves an average EM of 0.430 with Qwen2.5-3B, outperforming the strongest trajectory-level baseline (MR-Search) by 1.6 points and the step-level method GiGPO by 0.9 points on average across benchmarks, with particularly pronounced gains on multi-hop reasoning tasks. Despite introducing a dense step-level signal, IG-Search adds only ~6.4% to per-step training wall-clock time over the trajectory-level baseline and leaves inference latency unchanged, while still providing a meaningful gradient signal even when every sampled trajectory answers incorrectly.

Problem

Research questions and friction points this paper is trying to address.

search-augmented reasoning

trajectory-level rewards

step-level credit assignment

information gain

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Information Gain

Step-level Reward

Search-Augmented Reasoning