🤖 AI Summary
This study investigates whether a fine-tuned lexical retriever (BM25) remains sufficient to support effective deep research in the era of large language models (LLMs) with strong reasoning and tool-use capabilities. To this end, we introduce Pi-Serini, an agent that integrates retrieval, web browsing, and reading tools, combining a state-of-the-art LLM (e.g., GPT-5.5) with a deeply optimized BM25 retriever. Evaluated on the BrowseComp-Plus dataset, our system achieves 83.1% answer accuracy and 94.7% evidence recall. The results demonstrate that high-performance LLMs coupled with a refined BM25 can surpass existing dense-retrieval-based systems, challenging prevailing assumptions about the necessity of complex retrieval architectures and reaffirming the potential of lexical retrieval in deep research scenarios.
📝 Abstract
Does a lexical retriever suffice as large language models (LLMs) become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To support researchers asking the same question, we introduce Pi-Serini, a search agent equipped with three tools for retrieving, browsing, and reading documents. Our results show that, on BrowseComp-Plus, a well-configured lexical retriever with sufficient retrieval depth can support effective deep research when paired with more capable LLMs. Specifically, Pi-Serini with gpt-5.5 achieves 83.1% answer accuracy and 94.7% surfaced evidence recall, outperforming released search agents that use dense retrievers. Controlled ablations further show that BM25 tuning improves answer accuracy by 18.0% and surfaced evidence recall by 11.1% over the default BM25 setting, while increasing retrieval depth further improves surfaced evidence recall by 25.3% over the shallow-retrieval setting. Source code is available at https://github.com/justram/pi-serini.