SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the limitation of existing search agents, which predominantly rely on outcome-based supervision and lack fine-grained evaluation of intermediate reasoning and search steps. To overcome this, the authors propose SRR-Judge, a novel framework that, for the first time, enables highly reliable step-level scoring and correction mechanisms. The framework’s scores exhibit strong correlation with final answer correctness and outperform those of larger models such as DeepSeek-V3.1. SRR-Judge integrates an enhanced ReAct-style reasoning process, human annotations, and iterative rejection sampling-based fine-tuning, facilitating efficient post-training and policy alignment. Evaluated across multiple deep search benchmarks, the approach achieves an average pass@1 improvement exceeding 10%, substantially enhancing agents’ capability in complex, multi-step search tasks.

Technology Category

Application Category

📝 Abstract

Recent deep search agents built on large reasoning models (LRMs) excel at complex question answering by iteratively planning, acting, and gathering evidence, a capability known as search-integrated reasoning. However, mainstream approaches often train this ability using only outcome-based supervision, neglecting the quality of intermediate thoughts and actions. We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions. Integrated into a modified ReAct-style rate-and-refine workflow, SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation. Using SRR-annotated data, we apply an iterative rejection sampling fine-tuning procedure to enhance the deep search capability of the base agent. Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1, with its ratings showing strong correlation with final answer correctness. Moreover, aligning the policy with SRR-Judge annotated trajectories leads to substantial performance gains, yielding over a 10 percent average absolute pass@1 improvement across challenging deep search benchmarks.

Problem

Research questions and friction points this paper is trying to address.

search-integrated reasoning

step-level evaluation

deep search agents

intermediate reasoning quality

outcome-based supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

step-level evaluation

search-integrated reasoning

rate-and-refine