🤖 AI Summary
This work addresses the limitation of existing search agents, which predominantly rely on outcome-based supervision and lack fine-grained evaluation of intermediate reasoning and search steps. To overcome this, the authors propose SRR-Judge, a novel framework that, for the first time, enables highly reliable step-level scoring and correction mechanisms. The framework’s scores exhibit strong correlation with final answer correctness and outperform those of larger models such as DeepSeek-V3.1. SRR-Judge integrates an enhanced ReAct-style reasoning process, human annotations, and iterative rejection sampling-based fine-tuning, facilitating efficient post-training and policy alignment. Evaluated across multiple deep search benchmarks, the approach achieves an average pass@1 improvement exceeding 10%, substantially enhancing agents’ capability in complex, multi-step search tasks.
📝 Abstract
Recent deep search agents built on large reasoning models (LRMs) excel at complex question answering by iteratively planning, acting, and gathering evidence, a capability known as search-integrated reasoning. However, mainstream approaches often train this ability using only outcome-based supervision, neglecting the quality of intermediate thoughts and actions. We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions. Integrated into a modified ReAct-style rate-and-refine workflow, SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation. Using SRR-annotated data, we apply an iterative rejection sampling fine-tuning procedure to enhance the deep search capability of the base agent. Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1, with its ratings showing strong correlation with final answer correctness. Moreover, aligning the policy with SRR-Judge annotated trajectories leads to substantial performance gains, yielding over a 10 percent average absolute pass@1 improvement across challenging deep search benchmarks.