🤖 AI Summary
Existing LLM-based search agent training methods (e.g., GRPO) rely on sparse outcome rewards, neglecting entity-level information during reasoning and thus failing to distinguish “near-misses” from complete failures—leading to loss of fine-grained learning signals. To address this, we propose E-GRPO, an extension of GRPO that introduces entity matching rate as a dense, differentiable intermediate reward signal. This signal is derived from ground-truth entity recognition in synthetically generated data, enabling fine-grained supervision over the reasoning process. Our work is the first to empirically validate a strong correlation between the number of correctly identified entities in chain-of-thought reasoning and final answer accuracy. On multi-task question answering and deep-reasoning benchmarks, E-GRPO significantly outperforms GRPO: it improves answer accuracy, refines reasoning strategies, reduces tool invocation counts, and enhances sample efficiency.
📝 Abstract
LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative"near-miss"samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these"near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.