Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from static knowledge bases, and existing retrieval-augmented methods often introduce noisy or irrelevant information, degrading reasoning accuracy. To address this, we propose the “Rethink-and-Refine” paradigm—a reinforcement learning–driven framework for autonomous, iterative knowledge retrieval, noise filtering, information distillation, and structured evidence organization. Our key contributions are: (1) the first dynamic reasoning workflow alternating between search and refinement steps; (2) a composite reward function explicitly modeling retrieval quality, coupled with Groupwise Relative Policy Optimization (GRPO) for stable policy learning; and (3) a dynamic evidence synthesis mechanism that adaptively aggregates and structures retrieved content. Evaluated on both single-hop and multi-hop question answering benchmarks, our method significantly outperforms state-of-the-art approaches—achieving substantial gains in multi-hop accuracy—while also enabling more frequent, higher-quality retrievals and superior evidence integration capability.

Technology Category

Application Category

📝 Abstract
Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new ``search-and-refine-during-think'' paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.
Problem

Research questions and friction points this paper is trying to address.

Enhances LLM reasoning by refining retrieved knowledge iteratively
Reduces irrelevant information retrieval in reasoning processes
Improves multi-hop question answering performance significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning post-training framework
Search-and-refine-during-think paradigm
Group relative policy optimization rewards
🔎 Similar Papers
No similar papers found.
Yaorui Shi
Yaorui Shi
University of Science and Technology of China
Large Language Model
S
Shihan Li
University of Science and Technology of China
C
Chang Wu
University of Science and Technology of China
Z
Zhiyuan Liu
National University of Singapore
Junfeng Fang
Junfeng Fang
National University of Singapore
Model EditingAI SafetyLLM ExplainabilityAI4Science
Hengxing Cai
Hengxing Cai
Sun Yat-sen University
LLMVLMVLNUAV
An Zhang
An Zhang
University of Science and Technology
Generative ModelsTrustworthy AIAgentic AIRecommender System
X
Xiang Wang
University of Science and Technology of China