Repurposing Synthetic Data for Fine-grained Search Agent Supervision

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based search agent training methods (e.g., GRPO) rely on sparse outcome rewards, neglecting entity-level information during reasoning and thus failing to distinguish “near-misses” from complete failures—leading to loss of fine-grained learning signals. To address this, we propose E-GRPO, an extension of GRPO that introduces entity matching rate as a dense, differentiable intermediate reward signal. This signal is derived from ground-truth entity recognition in synthetically generated data, enabling fine-grained supervision over the reasoning process. Our work is the first to empirically validate a strong correlation between the number of correctly identified entities in chain-of-thought reasoning and final answer accuracy. On multi-task question answering and deep-reasoning benchmarks, E-GRPO significantly outperforms GRPO: it improves answer accuracy, refines reasoning strategies, reduces tool invocation counts, and enhances sample efficiency.

Technology Category

Application Category

📝 Abstract
LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative"near-miss"samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these"near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.
Problem

Research questions and friction points this paper is trying to address.

Existing training methods discard entity information in synthetic data
Current approaches fail to distinguish near-miss samples from complete failures
Sparse reward signals prevent learning from partially correct reasoning processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entity-aware reward function for partial credit
E-GRPO framework leveraging discarded entity information
Dense rewards based on entity match rate