🤖 AI Summary
To bridge the widening performance gap between closed-source search AI systems (e.g., GPT-4o Search Preview) and open-source alternatives, this paper introduces a plug-and-play open-source reasoning agent framework that endows arbitrary open-source LMs with real-time web retrieval and multi-step reasoning capabilities. Methodologically, it proposes a collaborative “reasoning agent + search tool” architecture supporting dynamic tool invocation, task decomposition, and result aggregation. It further develops Open Search Tool—a high-performance, open-source web search engine—achieving, for the first time on the FRAMES benchmark, superior performance over GPT-4o Search Preview. The framework is instantiated as Open Reasoning Agent, built upon open-source reasoning models such as DeepSeek-R1. Experiments demonstrate state-of-the-art results on SimpleQA (88.3%) and FRAMES (75.3%), outperforming strong baselines by +5.9% and +9.7%, respectively—surpassing several current closed-source SOTA systems on key metrics.
📝 Abstract
We introduce Open Deep Search (ODS) to close the increasing gap between the proprietary search AI solutions, such as Perplexity's Sonar Reasoning Pro and OpenAI's GPT-4o Search Preview, and their open-source counterparts. The main innovation introduced in ODS is to augment the reasoning capabilities of the latest open-source LLMs with reasoning agents that can judiciously use web search tools to answer queries. Concretely, ODS consists of two components that work with a base LLM chosen by the user: Open Search Tool and Open Reasoning Agent. Open Reasoning Agent interprets the given task and completes it by orchestrating a sequence of actions that includes calling tools, one of which is the Open Search Tool. Open Search Tool is a novel web search tool that outperforms proprietary counterparts. Together with powerful open-source reasoning LLMs, such as DeepSeek-R1, ODS nearly matches and sometimes surpasses the existing state-of-the-art baselines on two benchmarks: SimpleQA and FRAMES. For example, on the FRAMES evaluation benchmark, ODS improves the best existing baseline of the recently released GPT-4o Search Preview by 9.7% in accuracy. ODS is a general framework for seamlessly augmenting any LLMs -- for example, DeepSeek-R1 that achieves 82.4% on SimpleQA and 30.1% on FRAMES -- with search and reasoning capabilities to achieve state-of-the-art performance: 88.3% on SimpleQA and 75.3% on FRAMES.