Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration

📅 2026-02-03

📈 Citations: 1

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the multi-scale credit assignment challenge in reinforcement learning for retrieval-augmented reasoning, where sparse trajectory-level rewards fail to distinguish high-quality reasoning from answers that are correct by chance. To this end, the authors propose an Actor-Refiner collaborative framework that decomposes reasoning into an Actor module generating initial reasoning trajectories and a Meta-Refiner module that selectively diagnoses and corrects erroneous steps. The approach introduces a “cut-and-regenerate” refinement mechanism and a fine-grained hybrid reward combining answer correctness with the information density of retrieved evidence. Theoretical analysis and experiments demonstrate that the method significantly outperforms existing RAG and reinforcement learning baselines across multiple general and multi-hop question answering benchmarks, achieving higher reasoning accuracy across different model scales with minimal computational overhead.

Technology Category

Application Category

📝 Abstract

Search-integrated reasoning enables language agents to transcend static parametric knowledge by actively querying external sources. However, training these agents via reinforcement learning is hindered by the multi-scale credit assignment problem: existing methods typically rely on sparse, trajectory-level rewards that fail to distinguish between high-quality reasoning and fortuitous guesses, leading to redundant or misleading search behaviors. To address this, we propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention, with both components jointly optimized during training. Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories, and a Meta-Refiner, which selectively diagnoses and repairs flawed steps via a'cut-and-regenerate'mechanism. To provide fine-grained supervision, we introduce a hybrid reward design that couples outcome correctness with a dense process reward quantifying the information density of retrieved evidence. Theoretically, we formalize the Actor-Refiner interaction as a smoothed mixture policy, proving that selective correction yields strict performance gains over strong baselines. Extensive experiments across various general and multi-hop QA datasets demonstrate that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales, achieving superior reasoning accuracy with minimal overhead.

Problem

Research questions and friction points this paper is trying to address.

search-integrated reasoning

credit assignment problem

reinforcement learning

language agents

trajectory-level rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Actor-Refiner Collaboration

Search-Integrated Reasoning

Cut-and-Regenerate