🤖 AI Summary
Large language models (LLMs) suffer from static knowledge bases, and existing retrieval-augmented methods often introduce noisy or irrelevant information, degrading reasoning accuracy. To address this, we propose the “Rethink-and-Refine” paradigm—a reinforcement learning–driven framework for autonomous, iterative knowledge retrieval, noise filtering, information distillation, and structured evidence organization. Our key contributions are: (1) the first dynamic reasoning workflow alternating between search and refinement steps; (2) a composite reward function explicitly modeling retrieval quality, coupled with Groupwise Relative Policy Optimization (GRPO) for stable policy learning; and (3) a dynamic evidence synthesis mechanism that adaptively aggregates and structures retrieved content. Evaluated on both single-hop and multi-hop question answering benchmarks, our method significantly outperforms state-of-the-art approaches—achieving substantial gains in multi-hop accuracy—while also enabling more frequent, higher-quality retrievals and superior evidence integration capability.
📝 Abstract
Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new ``search-and-refine-during-think'' paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.