Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of enabling large language models (LLMs) to flexibly and autonomously acquire external knowledge during multi-step reasoning, this paper proposes a purely reinforcement learning (RL)-driven retrieval-augmented reasoning framework. Unlike prior approaches, it eliminates reliance on supervised data and hand-crafted tool-use logic, allowing the LLM to autonomously generate iterative search queries, invoke web search engines in real time, and synthesize answers from retrieved content. Key contributions include: (1) the first end-to-end RL paradigm for joint multi-step retrieval and reasoning; (2) a retrieval token masking mechanism to enhance training stability; and (3) a lightweight, result-oriented reward function. Our method achieves significant improvements over state-of-the-art baselines across seven open-domain QA benchmarks: +26% on Qwen2.5-7B, +21% on Qwen2.5-3B, and +10% on LLaMA3.2-3B. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract
Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Retrieval augmentation and tool-use training approaches where a search engine is treated as a tool lack complex multi-turn retrieval flexibility or require large-scale supervised data. Prompting advanced LLMs with reasoning capabilities during inference to use search engines is not optimal, since the LLM does not learn how to optimally interact with the search engine. This paper introduces Search-R1, an extension of the DeepSeek-R1 model where the LLM learns -- solely through reinforcement learning (RL) -- to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM rollouts with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 26% (Qwen2.5-7B), 21% (Qwen2.5-3B), and 10% (LLaMA3.2-3B) over SOTA baselines. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.
Problem

Research questions and friction points this paper is trying to address.

Enhance LLM reasoning with real-time search engine interaction.
Improve multi-turn retrieval flexibility without supervised data.
Optimize LLM performance using reinforcement learning techniques.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning trains LLMs for search queries.
Multi-turn search interactions enhance reasoning flexibility.
Retrieved token masking stabilizes RL training process.
🔎 Similar Papers
No similar papers found.